Algorithms for disease diagnostics

ABSTRACT

The present invention relates to compositions and methods for molecular profiling and diagnostics for genetic disorders and cancer, including but not limited to gene expression product markers associated with cancer or genetic disorders. In particular, the present invention provides algorithms and methods of classifying cancer, for example, thyroid cancer, methods of determining molecular profiles, and methods of analyzing results to provide a diagnosis.

CROSS REFERENCE

This application is a continuation application of U.S. patentapplication Ser. No. 15/661,496, filed Jul. 27, 2017; U.S. patentapplication Ser. No. 15/661,496 is a continuation-in-part application ofU.S. patent application Ser. No. 15/274,492, filed Sep. 23, 2016 and nowissued as U.S. Pat. No. 9,856,537, which is a continuation applicationof U.S. patent application Ser. No. 12/964,666, filed Dec. 9, 2010 andnow issued as U.S. Pat. No. 9,495,515, which claims the benefit of U.S.Provisional Patent Application No. 61/285,165, filed Dec. 9, 2009, andU.S. patent application Ser. No. 15/661,496 is a continuation-in-partapplication of U.S. patent application Ser. No. 13/589,022, filed Aug.17, 2012, which is a continuation application of U.S. patent applicationSer. No. 12/592,065, filed Nov. 17, 2009 and now issued as U.S. Pat. No.8,541,170, which claims the benefit of U.S. Provisional Application No.61/199,585, filed Nov. 17, 2008, and U.S. Provisional Application No.61/270,812, filed Jul. 13, 2009; each of which is entirely incorporatedherein by reference.

BACKGROUND OF THE INVENTION

Cancer is the second leading cause of death in the United States and oneof the leading causes of mortality worldwide. Nearly 25 million peopleare currently living with cancer, with 11 million new cases diagnosedeach year. Furthermore, as the general population continues to age,cancer will become a bigger and bigger problem. The World HealthOrganization projects that by the year 2020, global cancer rates willincrease by 50%.

Successful treatment of cancer starts with early and accurate diagnosis.Current methods of diagnosis include cytological examination of tissuesamples taken by biopsy or imaging of tissues and organs for evidence ofaberrant cellular proliferation. While these techniques have proven tobe both useful and inexpensive, they suffer from a number of drawbacks.First, cytological analysis and imaging techniques for cancer diagnosisoften require a subjective assessment to determine the likelihood ofmalignancy. Second, the increased use of these techniques has lead to asharp increase in the number of indeterminate results in which nodefinitive diagnosis can be made. Third, these routine diagnosticmethods lack a rigorous method for determining the probability of anaccurate diagnosis. Fourth, these techniques may be incapable ofdetecting a malignant growth at very early stages. Fifth, thesetechniques do not provide information regarding the basis of theaberrant cellular proliferation.

Many of the newer generation of treatments for cancer, while exhibitinggreatly reduced side effects, are specifically targeted to a certainmetabolic or signaling pathway, and will only be effective againstcancers that are reliant on that pathway. Further, the cost of anytreatments can be prohibitive for an individual, insurance provider, orgovernment entity. This cost could be at least partially offset byimproved methods that accurately diagnose cancers and the pathways theyrely on at early stages. These improved methods would be useful both forpreventing unnecessary therapeutic interventions as well as directingtreatment.

In the case of thyroid cancer it is estimated that out of theapproximately 130,000 thyroid removal surgeries performed each year dueto suspected malignancy in the United States, only about 54,000 arenecessary. Thus, approximately 76,000 unnecessary surgeries areperformed annually. In addition, there are continued treatment costs andcomplications due to the need for lifelong drug therapy to replace thelost thyroid function. Accordingly, there is a need for improved testingmodalities and business practices that improve upon current methods ofcancer diagnosis.

The thyroid has at least two kinds of cells that make hormones.Follicular cells make thyroid hormone, which affects heart rate, bodytemperature, and energy level. C cells make calcitonin, a hormone thathelps control the level of calcium in the blood. Abnormal growth in thethyroid can result in the formation of nodules, which can be eitherbenign or malignant. Thyroid cancer includes at least four differentkinds of malignant tumors of the thyroid gland: papillary, follicular,medullary and anaplastic.

SUMMARY OF THE INVENTION

In one embodiment, the invention is a method of diagnosing a geneticdisorder or cancer comprising the steps of: (a) obtaining a biologicalsample comprising gene expression products; (b) detecting the geneexpression products of the biological sample; (c) comparing to an amountin a control sample, an amount of one or more gene expression productsin the biological sample to determine the differential gene expressionproduct level between the biological sample and the control sample; (d)classifying the biological sample by inputting the one or moredifferential gene expression product levels to a trained algorithm;wherein technical factor variables are removed from data based ondifferential gene expression product level and normalized prior to andduring classification; and (e) identifying the biological sample aspositive for a genetic disorder or cancer if the trained algorithmclassifies the sample as positive for the genetic disorder or cancer ata specified confidence level.

In another embodiment, the invention is an algorithm for diagnosing agenetic disorder or cancer comprising: (a) determining the level of geneexpression products in a biological sample; (b) deriving the compositionof cells in the biological sample based on the expression levels ofcell-type specific markers in the sample; (c) removing technicalvariables prior to and during classification of the biological sample;(d) correcting or normalizing the gene product levels determined in step(a) based on the composition of cells determined in step (b); and (e)classifying the biological sample as positive for a genetic disorder orcancer.

The present invention includes a method for diagnosing thyroid diseasein a subject, the method comprising (a) providing a nucleic acid samplefrom a subject; (b) detecting the amount of one or more genes, geneproducts, or transcripts selected from the group consisting of the genesor transcripts listed in Tables 2 or their complement; and (c)determining whether said subject has or is likely to have a malignant orbenign thyroid condition based on the results of step (b).

The present invention also includes a composition comprising one or morebinding agents that specifically bind to the one or more polymorphismsselected from the group consisting of the polymorphisms listed inTables.

INCORPORATION BY REFERENCE

All publications and patent applications mentioned in this specificationare herein incorporated by reference to the same extent as if eachindividual publication or patent application was specifically andindividually indicated to be incorporated by reference.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth with particularity inthe appended claims. A better understanding of the features andadvantages of the present invention will be obtained by reference to thefollowing detailed description that sets forth illustrative embodiments,in which the principles of the invention are utilized, and theaccompanying drawings of which:

FIG. 1 is a table listing 75 thyroid samples examined for geneexpression analysis using the Affymetrix Human Exon 10ST array toidentify genes that are significantly differentially expressed oralternatively spliced between malignant, benign, and normal samples. Thename for each sample and the pathological classification is listed.

FIG. 2 is a table listing the top 100 differentially expressed genes atthe gene level. Data are from the dataset in which benign malignant andnormal thyroid samples were compared at the gene level. Markers wereselected based on statistical significance after Benjamini and Hochbergcorrection for false discover rate (FDR). Positive numbers denote upregulation and negative numbers denote down regulation of expression.

FIG. 3 is a table listing the top 100 alternatively spliced genes. Dataare from the dataset in which benign malignant and normal thyroidsamples were compared at the gene level. Markers were selected based onstatistical significance after Benjamini and Hochberg correction forfalse discovery rate (FDR).

FIG. 4 is a table listing the top 100 differentially expressed genes atthe probe-set level. Data were from the Probe-set dataset. Positivenumbers denote up-regulation of gene expression, while negative numbersdenote down regulation.

FIG. 5 is a table listing the top 100 significant diagnostic markersdetermined by gene level analysis. Markers in this list show bothdifferential gene expression and alternative exon splicing. Positivenumbers denote up-regulation, while negative numbers denote downregulation. This table lists 3-sets of calculated fold-changes for anygiven marker to allow comparison between the groups malignant vs.benign, benign, versus normal, and malignant versus normal.

FIG. 6 is a table listing the genes identified as contributing tothyroid cancer diagnosis by molecular profiling of gene expressionlevels and/or alternative exon splicing. Markers identified from thedataset in which benign, malignant and normal samples were analyzed atthe gene level are referred to as BMN in the data source column; andlikewise, markers identified from dataset in which the benign andmalignant samples were analyzed at the gene level are referred to as BMin the data source column. Similarly, markers identified at theprobe-set level from the dataset in which benign, and malignant sampleswere analyzed are referred to as Probe-set in the data source column.

FIG. 7 is a table listing tissue samples examined for gene expressionanalysis. The samples were classified by pathological analysis as benign(B) or malignant (M). Benign samples were further classified asfollicular adenoma (FA), lymphocytic thyroiditis (LCT), or nodularhyperplasia (NHP). Malignant samples were further classified as Hurthlecell carcinoma (HC), follicular carcinoma (FC), follicular variant ofpapillary thyroid carcinoma (FVPTC), papillary thyroid carcinoma (PTC),medullary thyroid carcinoma (MTC), or anaplastic carcinoma (ATC).

FIG. 8 is a table listing fine needle aspirate samples examined for geneexpression analysis. The samples were classified by pathologicalanalysis as benign (B) or malignant (M). Benign samples were furtherclassified as follicular adenoma (FA), lymphocytic thyroiditis (LCT),Hurthle cell adenoma (HA), or nodular hyperplasia (NHP). Malignantsamples were further classified as Hurthle cell carcinoma (HC),follicular carcinoma (FC), follicular variant of papillary thyroidcarcinoma (FVPTC), papillary thyroid carcinoma (PTC), medullary thyroidcarcinoma (MTC), or anaplastic carcinoma (ATC).

FIG. 9 is a table listing genes identified from expression analysis ofthe tissue samples listed in FIG. 7 which exhibit significantdifferences in expression between malignant and benign samples asdetermined by feature selection using LIMMA (linear models for microarray data) and SVM (support vector machine) for classification ofmalignant vs. benign samples. Rank denotes the marker significance(lower rank, higher significance) after Benjamini and Hochbergcorrection for False Discovery Rate (FDR). Gene symbol denotes the nameof the gene. TCID denotes the transcript cluster ID of the gene used inthe Affymetrix Human Exon 10ST array. Ref Seq denotes the name of thecorresponding reference sequence for that gene. The column labeled“Newly Discovered Marker” denotes gene expression markers which have notpreviously been described as differentially expressed in malignant vs.benign thyroid tissues.

FIG. 10 is a table listing genes identified from expression analysis ofthe tissue samples listed in FIG. 8 which exhibit significantdifferences in expression between medullary thyroid carcinoma (MTC) andother pathologies as determined by feature selection using LIMMA (linearmodels for micro array data) and SVM (support vector machine) forclassification of MTC vs. other samples. Rank denotes the markersignificance (lower rank, higher significance) after Benjamini andHochberg correction for False Discovery Rate (FDR). Gene symbol denotesthe name of the gene. TCID denotes the transcript cluster ID of the geneused in the Affymetrix Human Exon 10ST array. P value indicates thestatistical significance of the differential expression between MTC andnon-MTC samples. Fold Change indicates the degree of differentialexpression between MTC and non-MTC samples. The column labeled “NewlyDiscovered Marker” denotes gene expression markers which have notpreviously been described as differentially expressed in malignant vs.benign thyroid tissues.

FIG. 11 is a table listing genes identified from expression analysis ofthe samples listed in FIGS. 7 and 8 which exhibit significantdifferences in expression between benign and malignant samples asdetermined by a repeatability based meta-analysis classificationalgorithm.

FIG. 12 is a table listing genes identified from expression analysis ofthe samples listed in FIGS. 7 and 8 which exhibit significant (posteriorprobability >0.9) differences in expression between benign and malignantsamples as determined by Bayesian ranking of the differentiallyexpressed genes. deriving type I and type II error rates from previouslypublished studies to determine prior probabilities, combining theseprior probabilities with the output of the dataset derived fromexpression analysis of the samples listed in FIG. 10 to estimateposterior probabilities of differential gene expression, and thencombining the results of the expression analysis of the samples listedin FIG. 11 with the estimated posterior probabilities to calculate finalposterior probabilities of differential gene expression. These posteriorprobabilities were then used to rank the differentially expressed genes.

FIG. 13 is a table listing genes identified from expression analysis ofthe samples listed in FIG. 7 which exhibit differential expressionbetween samples categorized as FA, LCT, NHP, HC, FC, FVPTC, PTC, MTC, orATC as determined by feature selection using LIMMA (linear models formicro array data) and SVM (support vector machine) for classification.

FIG. 14 is a table listing fine needle aspirate samples examined formicro RNA (miRNA) expression analysis using an Agilent Human v2 miRNAmicroarray chip. The samples were classified by pathological analysis asbenign (B) or malignant (M). Benign samples were further classified asfollicular adenoma (FA), or nodular hyperplasia (NHP). Malignant sampleswere further classified as follicular carcinoma (FC), follicular variantof papillary thyroid carcinoma (FVPTC), papillary thyroid carcinoma(PTC), or medullary thyroid carcinoma (MTC).

FIG. 15 is a table listing fine needle aspirate samples examined formicro RNA (miRNA) expression analysis using an Illumina Human v2 miRNAarray. The samples were classified by pathological analysis as benign(B), non diagnostic, or malignant (M). Benign samples were furtherclassified as benign nodule (BN), follicular neoplasm (FN), (LCT), or(NHP). Malignant samples were further classified as (FVPTC), or (PTC).

FIG. 16 is a table listing micro RNAs (miRNAs) identified from analysisof the samples listed in FIG. 14 which exhibit differential expressionbetween samples categorized as benign or malignant. The miRNA columndenotes the name of the miRNA. The CHR column denotes the chromosome themiRNA is located on. The P column denotes the statistical confidence orp-value provided by the analysis. The DE column denotes whether thelisted miRNA is upregulated (1) in malignant samples or downregulated(−1) in malignant samples. The patent column denotes any patents orapplications that describe these miRNAs.

FIG. 17 is a table listing micro RNAs (miRNAs) identified from analysisof the samples listed in FIG. 15 which exhibit differential expressionbetween samples categorized as benign or malignant. The miRNA columndenotes the name of the miRNA. The probe ID column denotes thecorresponding probe ID in the Illumina array. The CHR column denotes thechromosome the miRNA is located on. The P column denotes the statisticalconfidence or p-value provided by the analysis. The DE column denoteswhether the listed miRNA is upregulated (no sign) in malignant samplesor downregulated (negative sign) in malignant samples. The Rep columndenotes the repeatability score provided by a “hot probes” type analysisof the hybridization data. The patent column denotes any patents orapplications that describe these miRNAs.

FIG. 18 is a flow chart describing how molecular profiling may be usedto improve the accuracy of routine cytological examination. FIG. 18A andFIG. 18B describe alternate embodiments of the molecular profilingbusiness.

FIG. 19 is an illustration of a kit provided by the molecular profilingbusiness.

FIG. 20 is an illustration of a molecular profiling results report.

FIG. 21 depicts a computer useful for displaying, storing, retrieving,or calculating diagnostic results from the molecular profiling;displaying, storing, retrieving, or calculating raw data from genomic ornucleic acid expression analysis; or displaying, storing, retrieving, orcalculating any sample or customer information useful in the methods ofthe present invention.

FIG. 22 depicts a titration curve of error rate vs. number of genesusing an SVM-based classification algorithm. The titration curveplateaus when the classification algorithm examines 200-250 genes. Thesedata indicate that the overall error rate of the current algorithm was4% ( 5/138).

FIG. 23 shows an example of technical factor effects using variancedecomposition analysis.

FIG. 24 shows that P-values improve after technical factor removal. Inthis example the technical factor removed was sample collection fluid.Two chemically distinct fluids were used for sample preservation at thetime of collection in the clinic. This technical factor obscured thebiological signal present during standard analysis. The 45-degree linein the graph indicates that the p-values calculated from the raw data(log 10 scale) were lower than the p-values calculated after thetechnical factor removal method was used on the same dataset. P-valuesbecome more significant with technical factor removal. Here technicalfactor is collection fluid. A large number of samples came from anothercollection fluid and that effect obscures the biological signal presentin the markers.

DETAILED DESCRIPTION OF THE INVENTION

I. Introduction

The present disclosure provides novel methods for diagnosing abnormalcellular proliferation from a biological test sample, and related kitsand compositions. The present invention also provides methods andcompositions for differential diagnosis of types of aberrant cellularproliferation such as carcinomas including follicular carcinomas (FC),follicular variant of papillary thyroid carcinomas (FVPTC), Hurthle cellcarcinomas (HC), Hurthle cell adenomas (HA); papillary thyroidcarcinomas (PTC), medullary thyroid carcinomas (MTC), and anaplasticcarcinomas (ATC); adenomas including follicular adenomas (FA); nodulehyperplasias (NHP); colloid nodules (CN); benign nodules (BN);follicular neoplasms (FN); lymphocytic thyroiditis (LCT), includinglymphocytic autoimmune thyroiditis; parathyroid tissue; renal carcinomametastasis to the thyroid; melanoma metastasis to the thyroid; B-celllymphoma metastasis to the thyroid; breast carcinoma to the thyroid;benign (B) tumors, malignant (M) tumors, and normal (N) tissues. Thepresent invention further provides novel markers including microRNAs(miRNAs) and gene expression product markers and novel groups of genesand markers useful for the diagnosis, characterization, and treatment ofcellular proliferation. Additionally the present invention providesbusiness methods for providing enhanced diagnosis, differentialdiagnosis, monitoring, and treatment of cellular proliferation.

Cancer is a leading cause of death in the United States. Early andaccurate diagnosis of cancer is critical for effective management ofthis disease. It is therefore important to develop testing modalitiesand business practices to enable cancer diagnosis that is moreaccurately and earlier. Expression product profiling, also referred toas molecular profiling, provides a powerful method for early andaccurate diagnosis of tumors or other types of cancers from a biologicalsample.

Typically, screening for the presence of a tumor or other type ofcancer, involves analyzing a biological sample taken by various methodssuch as, for example, a biopsy. The biological sample is then preparedand examined by one skilled in the art. The methods of preparation caninclude but are not limited to various cytological stains, andimmuno-histochemical methods. Unfortunately, traditional methods ofcancer diagnosis suffer from a number of deficiencies. Thesedeficiencies include: 1) the diagnosis may require a subjectiveassessment and thus be prone to inaccuracy and lack of reproducibility,2) the methods may fail to determine the underlying genetic, metabolicor signaling pathways responsible for the resulting pathogenesis, 3) themethods may not provide a quantitative assessment of the test results,and 4) the methods may be unable to provide an unambiguous diagnosis forcertain samples.

One hallmark of cancer is dysregulation of normal transcriptionalcontrol leading to aberrant expression of genes or other RNA transcriptssuch as miRNAs. Among the aberrantly expressed transcripts are genesinvolved in cellular transformation, for example tumor suppressors andoncogenes. Tumor suppressor genes and oncogenes may be up-regulated ordown-regulated in tumors when compared to normal tissues. Known tumorsuppressors and oncogenes include, but are not limited to brca1, brca2,bcr-ab1, bcl-2, HER2, N-myc, C-myc, BRAF, RET, Ras, KIT, Jun, Fos, andp53. This abnormal expression may occur through a variety of differentmechanisms. It is not necessary in the present invention to understandthe mechanism of aberrant expression, or the mechanism by whichcarcinogenesis occurs. Nevertheless, finding a marker or set of markerswhose expression is up or down regulated in a sample as compared to anormal sample may be indicative of cancer. Furthermore, the particularaberrantly expressed markers or set of markers may be indicative of aparticular type of cancer, or even a recommended treatment protocol.Additionally the methods of the present invention are not meant to belimited solely to canonically defined tumor suppressors or oncogenes.Rather, it is understood that any marker, gene or set of genes ormarkers that is determined to have a statistically significantcorrelation with respect to expression level or alternative genesplicing to a benign, malignant, or normal diagnosis is encompassed bythe present invention.

In one embodiment, the methods of the present invention seek to improveupon the accuracy of current methods of cancer diagnosis. Improvedaccuracy can result from the measurement of multiple genes and/orexpression markers, the identification of gene expression products suchas miRNAs, rRNA, tRNA and mRNA gene expression products with highdiagnostic power or statistical significance, or the identification ofgroups of genes and/or expression products with high diagnostic power orstatistical significance, or any combination thereof.

For example, increased expression of a number of receptor tyrosinekinases has been implicated in carcinogenesis. Measurement of the geneexpression product level of a particular receptor tyrosine kinase knownto be differentially expressed in cancer cells may provide incorrectdiagnostic results leading to a low accuracy rate. Measurement of aplurality of receptor tyrosine kinases may increase the accuracy levelby requiring a combination of alternate expressed genes to occur. Insome cases, measurement of a plurality of genes might therefore increasethe accuracy of a diagnosis by reducing the likelihood that a sample mayexhibit an aberrant gene expression profile by random chance.

Similarly, some gene expression products within a group such as receptortyrosine kinases may be indicative of a disease or condition when theirexpression levels are higher or lower than normal. The measurement ofexpression levels of other gene products within that same group mayprovide diagnostic utility. Thus, in one embodiment, the inventionmeasures two or more gene expression products that are within a group.For example, in some embodiments, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20,25, 30, 35, 40, 45 or 50 gene expression products are measured from agroup. Various groups are defined within the specification, such asgroups useful for diagnosis of subtypes of thyroid cancer or groups ofgene expression products that fall within particular ontology groups. Inanother embodiment, it would be advantageous to measure the expressionlevels of sets of genes that accurately indicate the presence or absenceof cancer from multiple groups. For example, the invention contemplatesthe use of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45 or50 gene expression groups, each with 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15,20, 25, 30, 35, 40, 45 or 50 gene expression products measured.

Additionally, increased expression of other oncogenes such as forexample Ras in a biological sample may also be indicative of thepresence of cancerous cells. In some cases, it may be advantageous todetermine the expression level of several different classes of oncogenessuch as for example receptor tyrosine kinases, cytoplasmic tyrosinekinases, GTPases, serine/threonine kinases, lipid kinases, mitogens,growth factors, and transcription factors. The determination ofexpression levels and/or exon usage of different classes or groups ofgenes involved in cancer progression may in some cases increase thediagnostic power of the present invention.

Groups of expression markers may include markers within a metabolic orsignaling pathway, or genetically or functionally homologous markers.For example, one group of markers may include genes involved in theepithelial growth factor signaling pathway. Another group of markers mayinclude mitogen-activated protein kinases. The present invention alsoprovides methods and compositions for detecting (i.e. measuring)measuring gene expression markers from multiple and/or independentmetabolic or signaling pathways.

In one embodiment, expression product markers of the present inventionmay provide increased accuracy of cancer diagnosis through the use ofmultiple expression product markers and statistical analysis. Inparticular, the present invention provides, but is not limited to, RNAexpression profiles associated with thyroid cancers. The presentinvention also provides methods of characterizing thyroid tissuesamples, and kits and compositions useful for the application of saidmethods. The disclosure further includes methods for running a molecularprofiling business.

The present disclosure provides methods and compositions for improvingupon the current state of the art for diagnosing cancer.

In some embodiments, the present invention provides a method ofdiagnosing cancer comprising the steps of: obtaining a biological samplecomprising gene expression products; determining the expression levelfor one or more gene expression products of the biological sample; andidentifying the biological sample as cancerous wherein the geneexpression level is indicative of the presence of thyroid cancer in thebiological sample. This can be done by correlating the gene expressionlevels with the presence of thyroid cancer in the biological sample. Inone embodiment, the gene expression products are selected from FIG. 6.In some embodiments, the method further comprises the step of comparingthe expression level of the one or more gene expression products to acontrol expression level for each gene expression product in a controlsample, wherein the biological sample is identified as cancerous ifthere is a difference in the gene expression level between a geneexpression product in the biological sample and the control sample.

In some embodiments, the present invention provides a method ofdiagnosing cancer comprising the steps of: obtaining a biological samplecomprising alternatively spliced gene expression products; determiningthe expression level for one or more gene expression products of thebiological sample; and identifying the biological sample as cancerouswherein the gene expression level is indicative of the presence ofthyroid cancer in the biological sample. This can be done by correlatingthe gene expression levels with the presence of thyroid cancer in thebiological sample. In one embodiment, the alternatively spliced geneexpression products are selected from FIG. 6, wherein the differentialgene expression product alternative exon usage is compared between thebiological sample and a control sample; and identifying the biologicalsample as cancerous if there is a difference in gene expression productalternative exon usage between the biological sample and the controlsample at a specified confidence level. In some embodiments, the genesselected from FIG. 6 are further selected from genes listed in FIG. 2,FIG. 3, FIG. 4, or FIG. 5.

In some embodiments, the present invention provides a method ofdiagnosing cancer that gives a specificity or sensitivity that isgreater than 70% using the subject methods described herein, wherein thegene expression product levels are compared between the biologicalsample and a control sample; and identifying the biological sample ascancerous if there is a difference in the gene expression levels betweenthe biological sample and the control sample at a specified confidencelevel. In some embodiments, the specificity and/or sensitivity of thepresent method is at least 70%, 75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%,91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or more.

In some embodiments, the nominal specificity is greater than or equal to70%. The nominal negative predictive value (NPV) is greater than orequal to 95%. In some embodiments, the NPV is at least 95%, 95.5%, 96%,96.5%, 97%, 97.5%, 98%, 98.5%, 99%, 99.5% or more.

Sensitivity typically refers to TP/(TP+FN), where TP is true positiveand FN is false negative. Number of Continued Indeterminate resultsdivided by the total number of malignant results based on adjudicatedhistopathology diagnosis. Specificity typically refers to TN/(TN+FP),where TN is true negative and FP is false positive. The number of benignresults divided by the total number of benign results based onadjudicated histopathology diagnosis. Positive Predictive Value (PPV):TP/(TP+FP); Negative Predictive Value (NPV): TN/(TN+FN).

Marker panels are chosen to accommodate adequate separation of benignfrom non-benign expression profiles. Training of this multi-dimensionalclassifier, i.e., algorithm, was performed on over 500 thyroid samples,including >300 thyroid FNAs. Many training/test sets were used todevelop the preliminary algorithm. An exemplary data set is shown inFIG. 22. First the overall algorithm error rate is shown as a functionof gene number for benign vs non-benign samples. All results areobtained using a support vector machine model which is trained andtested in a cross-validated mode (30-fold) on the samples.

In some embodiments, the difference in gene expression level is at least10%, 15%, 20%, 25%, 30%, 35%, 40%, 45% or 50% or more. In someembodiments, the difference in gene expression level is at least 2, 3,4, 5, 6, 7, 8, 9, 10 fold or more. In some embodiments, the biologicalsample is identified as cancerous with an accuracy of greater than 75%,80%, 85%, 90%, 95%, 99% or more. In some embodiments, the biologicalsample is identified as cancerous with a sensitivity of greater than95%. In some embodiments, the biological sample is identified ascancerous with a specificity of greater than 95%. In some embodiments,the biological sample is identified as cancerous with a sensitivity ofgreater than 95% and a specificity of greater than 95%. In someembodiments, the accuracy is calculated using a trained algorithm.

In some embodiments, the present invention provides gene expressionproducts corresponding to genes selected from Table 3, Table 4 and/orTable 5.

In some embodiments, the present invention provides a method ofdiagnosing cancer comprising using gene expression products from one ormore of the following signaling pathways. The signaling pathways fromwhich the genes can be selected include but are not limited to: acutemyeloid leukemia signaling, somatostatin receptor 2 signaling,cAMP-mediated signaling, cell cycle and DNA damage checkpoint signaling,G-protein coupled receptor signaling, integrin signaling, melanoma cellsignaling, relaxin signaling, and thyroid cancer signaling. In someembodiments, more than one gene is selected from a single signalingpathway to determine and compare the differential gene expressionproduct level between the biological sample and a control sample. Othersignaling pathways include, but are not limited to, an adherens, ECM,thyroid cancer, focal adhesion, apoptosis, p53, tight junction, TGFbeta,ErbB, Wnt, pathways in cancer overview, cell cycle, VEGF, Jak/STAT,MAPK, PPAR, mTOR or autoimmune thyroid pathway. In other embodiments, atleast two genes are selected from at least two different signalingpathways to determine and compare the differential gene expressionproduct level between the biological sample and the control sample.Methods and compositions of the invention can have genes selected from1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50 or moresignaling pathways and can have from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15,20, 25, 30, 35, 40, 45, 50 or more gene expression products from eachsignaling pathway, in any combination. In some embodiments, the set ofgenes combined give a specificity or sensitivity of greater than 70%,75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%,97%, 98%, 99%, or 99.5%, or a positive predictive value or negativepredictive value of at least 95%, 95.5%, 96%, 96.5%, 97%, 97.5%, 98%,98.5%, 99%, 99.5% or more.

In some embodiments, the present invention provides a method ofdiagnosing cancer comprising genes selected from at least two differentontology groups. In some embodiments, the ontology groups from which thegenes can be selected include but are not limited to: cell aging, cellcortex, cell cycle, cell death/apoptosis, cell differentiation, celldivision, cell junction, cell migration, cell morphogenesis, cellmotion, cell projection, cell proliferation, cell recognition, cellsoma, cell surface, cell surface linked receptor signal transduction,cell adhesion, transcription, immune response, or inflammation. In someembodiments, more than one gene is selected from a single ontology groupto determine and compare the differential gene expression product levelbetween the biological sample and a control sample. In otherembodiments, at least two genes are selected from at least two differentontology groups to determine and compare the differential geneexpression product level between the biological sample and the controlsample. Methods and compositions of the invention can have genesselected from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45,50 or more gene ontology groups and can have from 1, 2, 3, 4, 5, 6, 7,8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50 or more gene expressionproducts from each gene ontology group, in any combination. In someembodiments, the set of genes combined give a specificity or sensitivityof greater than 70%, 75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%,93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5%, or a positive predictivevalue or negative predictive value of at least 95%, 95.5%, 96%, 96.5%,97%, 97.5%, 98%, 98.5%, 99%, 99.5% or more.

In some embodiments, the present invention provides a method ofclassifying cancer comprising the steps of: obtaining a biologicalsample comprising gene expression products; determining the expressionlevel for one or more gene expression products of the biological samplethat are differentially expressed in different subtypes of a cancer; andidentifying the biological sample as cancerous wherein the geneexpression level is indicative for a subtype of cancer. In someembodiments, the method further comprises the step of comparing theexpression level of the one or more gene expression products to acontrol expression level for each gene expression product in a controlsample, wherein the biological sample is identified as cancerous ifthere is a difference in the gene expression level between a geneexpression product in the biological sample and the control sample. Insome embodiments, the subject methods distinguish follicular carcinomafrom medullary carcinoma. In some embodiments, the subject methodsdistinguish a benign thyroid disease from a malignant thyroidtumor/carcinoma.

In some embodiments, the gene expression product of the subject methodsis a protein, and the amount of protein is compared. The amount ofprotein can be determined by one or more of the following: ELISA, massspectrometry, blotting, or immunohistochemistry. RNA can be measured byone or more of the following: microarray, SAGE, blotting, RT-PCR, orquantitative PCR.

In some embodiments, the difference in gene expression level, forexample, mRNA, protein, or alternatively spliced gene product, between abiological sample and a control sample that can be used to diagnosecancer is at least 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5,8, 8.5, 9, 9.5, 10 fold or more.

In some embodiments, the biological sample is classified as cancerous orpositive for a subtype of cancer with an accuracy of greater than 75%,80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%,98%, 99%, or 99.5%. The diagnosis accuracy as used herein includesspecificity, sensitivity, positive predictive value, negative predictivevalue, and/or false discovery rate.

When classifying a biological sample for diagnosis of cancer, there aretypically four possible outcomes from a binary classifier. If theoutcome from a prediction is p and the actual value is also p, then itis called a true positive (TP); however if the actual value is n then itis said to be a false positive (FP). Conversely, a true negative hasoccurred when both the prediction outcome and the actual value are n,and false negative is when the prediction outcome is n while the actualvalue is p. In one embodiment, consider a diagnostic test that seeks todetermine whether a person has a certain disease. A false positive inthis case occurs when the person tests positive, but actually does nothave the disease. A false negative, on the other hand, occurs when theperson tests negative, suggesting they are healthy, when they actuallydo have the disease. In some embodiments, ROC curve assuming real-worldprevalence of subtypes can be generated by re-sampling errors achievedon available samples in relevant proportions.

The positive predictive value (PPV), or precision rate, or post-testprobability of disease, is the proportion of patients with positive testresults who are correctly diagnosed. It is the most important measure ofa diagnostic method as it reflects the probability that a positive testreflects the underlying condition being tested for. Its value doeshowever depend on the prevalence of the disease, which may vary. In oneexample, FP (false positive); TN (true negative); TP (true positive); FN(false negative).

False positive rate (α)=FP/(FP+TN)−specificity

False negative rate (β)=FN/(TP+FN)−sensitivity

Power=sensitivity=1−β

Likelihood-ratio positive=sensitivity/(1−specificity)

Likelihood-ratio negative=(1−sensitivity)/specificity

The negative predictive value is the proportion of patients withnegative test results who are correctly diagnosed. PPV and NPVmeasurements can be derived using appropriate disease subtype prevalenceestimates. An estimate of the pooled malignant disease prevalence can becalculated from the pool of indeterminates which roughly classify into Bvs M by surgery. For subtype specific estimates, in some embodiments,disease prevalence may sometimes be incalculable because there are notany available samples. In these cases, the subtype disease prevalencecan be substituted by the pooled disease prevalence estimate.

In some embodiments, the level of expression products or alternativeexon usage is indicative of one of the following: follicular cellcarcinoma, anaplastic carcinoma, medullary carcinoma, or sarcoma. Insome embodiments, the one or more genes selected using the methods ofthe present invention for diagnosing cancer contain representativesequences corresponding to a set of metabolic or signaling pathwaysindicative of cancer.

In some embodiments, the results of the expression analysis of thesubject methods provide a statistical confidence level that a givendiagnosis is correct. In some embodiments, such statistical confidencelevel is above 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or99.5%.

In another aspect, the present invention provides a composition fordiagnosing cancer comprising oligonucleotides comprising a portion ofone or more of the genes listed in FIG. 6 or their complement, and asubstrate upon which the oligonucleotides are covalently attached. Thecomposition of the present invention is suitable for use in diagnosingcancer at a specified confidence level using a trained algorithm. In oneexample, the composition of the present invention is used to diagnosethyroid cancer.

In one aspect of the present disclosure, samples that have beenprocessed by a cytological company, subjected to routine methods andstains, diagnosed and categorized, are then subjected to molecularprofiling as a second diagnostic screen. This second diagnostic screenenables: 1) a significant reduction of false positives and falsenegatives, 2) a determination of the underlying genetic, metabolic, orsignaling pathways responsible for the resulting pathology, 3) theability to assign a statistical probability to the accuracy of thediagnosis, 4) the ability to resolve ambiguous results, and 5) theability to distinguish between sub-types of cancer.

For example, in the specific case of thyroid cancer, molecular profilingof the present invention may further provide a diagnosis for thespecific type of thyroid cancer (e.g. papillary, follicular, medullary,or anaplastic). The results of the molecular profiling may further allowone skilled in the art, such as a scientist or medical professional tosuggest or prescribe a specific therapeutic intervention. Molecularprofiling of biological samples may also be used to monitor the efficacyof a particular treatment after the initial diagnosis. It is furtherunderstood that in some cases, molecular profiling may be used in placeof, rather than in addition to, established methods of cancer diagnosis.

In one aspect, the present invention provides algorithms and methodsthat can be used for diagnosis and monitoring of a genetic disorder. Agenetic disorder is an illness caused by abnormalities in genes orchromosomes. While some diseases, such as cancer, are due in part togenetic disorders, they can also be caused by environmental factors. Insome embodiments, the algorithms and the methods disclosed herein areused for diagnosis and monitoring of a cancer such as thyroid cancer.

Genetic disorders can be typically grouped into two categories: singlegene disorders and multifactorial and polygenic (complex) disorders. Asingle gene disorder is the result of a single mutated gene. There areestimated to be over 4000 human diseases caused by single gene defects.Single gene disorders can be passed on to subsequent generations inseveral ways. There are several types of inheriting a single genedisorder including but not limited to autosomal dominant, autosomalrecessive, X-linked dominant, X-linked recessive, P-linked andmitochondrial inheritance. Only one mutated copy of the gene will benecessary for a person to be affected by an autosomal dominant disorder.Examples of autosomal dominant type of disorder include but are notlimited to Huntington's disease, Neurofibromatosis 1, Marfan Syndrome,Hereditary nonpolyposis colorectal cancer, and Hereditary multipleexostoses. In autosomal recessive disorder, two copies of the gene mustbe mutated for a person to be affected by an autosomal recessivedisorder. Examples of this type of disorder include but are not limitedto cystic fibrosis, sickle-cell disease (also partial sickle-celldisease), Tay-Sachs disease, Niemann-Pick disease, spinal muscularatrophy, and dry earwax. X-linked dominant disorders are caused bymutations in genes on the X chromosome. Only a few disorders have thisinheritance pattern, with a prime example being X-linkedhypophosphatemic rickets. Males and females are both affected in thesedisorders, with males typically being more severely affected thanfemales. Some X-linked dominant conditions such as Rett syndrome,Incontinentia Pigmenti type 2 and Aicardi Syndrome are usually fatal inmales either in utero or shortly after birth, and are thereforepredominantly seen in females. X-linked recessive disorders are alsocaused by mutations in genes on the X chromosome. Examples of this typeof disorder include but are not limited to Hemophilia A, Duchennemuscular dystrophy, red-green color blindness, muscular dystrophy andAndrogenetic alopecia. Y-linked disorders are caused by mutations on theY chromosome. Examples include but are not limited to Male Infertilityand hypertrichosis pinnae. Mitochondrial inheritance, also known asmaternal inheritance, applies to genes in mitochondrial DNA. An exampleof this type of disorder is Leber's Hereditary Optic Neuropathy.

Genetic disorders may also be complex, multifactorial or polygenic, thismeans that they are likely associated with the effects of multiple genesin combination with lifestyle and environmental factors. Althoughcomplex disorders often cluster in families, they do not have aclear-cut pattern of inheritance. This makes it difficult to determine aperson's risk of inheriting or passing on these disorders. Complexdisorders are also difficult to study and treat because the specificfactors that cause most of these disorders have not yet been identified.Multifactoral or polygenic disorders that can be diagnosed,characterized and/or monitored using the algorithms and methods of thepresent invention include but are not limited to heart disease,diabetes, asthma, autism, autoimmune diseases such as multiplesclerosis, cancers, ciliopathies, cleft palate, hypertension,inflammatory bowel disease, mental retardation and obesity.

Other genetic disorders that can be diagnosed, characterized and/ormonitored using the algorithms and methods of the present inventioninclude but are not limited to 1p36 deletion syndrome, 21-hydroxylasedeficiency, 22q11.2 deletion syndrome, 47,XYY syndrome, 48, XXXX, 49,XXXXX, aceruloplasminemia, achondrogenesis, type II, achondroplasia,acute intermittent porphyria, adenylosuccinate lyase deficiency,Adrenoleukodystrophy, ALA deficiency porphyria, ALA dehydratasedeficiency, Alexander disease, alkaptonuria, alpha-1 antitrypsindeficiency, Alstrom syndrome, Alzheimer's disease (type 1, 2, 3, and 4),Amelogenesis Imperfecta, amyotrophic lateral sclerosis, Amyotrophiclateral sclerosis type 2, Amyotrophic lateral sclerosis type 4,amyotrophic lateral sclerosis type 4, androgen insensitivity syndrome,Anemia, Angelman syndrome, Apert syndrome, ataxia-telangiectasia,Beare-Stevenson cutis gyrata syndrome, Benjamin syndrome, betathalassemia, biotinidase deficiency, Birt-Hogg-Dubé syndrome, bladdercancer, Bloom syndrome, Bone diseases, breast cancer, CADASIL,Camptomelic dysplasia, Canavan disease, Cancer, Celiac Disease, CGDChronic Granulomatous Disorder, Charcot-Marie-Tooth disease,Charcot-Marie-Tooth disease Type 1, Charcot-Marie-Tooth disease Type 4,Charcot-Marie-Tooth disease, type 2, Charcot-Marie-Tooth disease, type4, Cockayne syndrome, Coffin-Lowry syndrome, collagenopathy, types IIand XI, Colorectal Cancer, Congenital absence of the vas deferens,congenital bilateral absence of vas deferens, congenital diabetes,congenital erythropoietic porphyria, Congenital heart disease,congenital hypothyroidism, Connective tissue disease, Cowden syndrome,Cri du chat, Crohn's disease, fibrostenosing, Crouzon syndrome,Crouzonodermoskeletal syndrome, cystic fibrosis, De Grouchy Syndrome,Degenerative nerve diseases, Dent's disease, developmental disabilities,DiGeorge syndrome, Distal spinal muscular atrophy type V, Down syndrome,Dwarfism, Ehlers-Danlos syndrome, Ehlers-Danlos syndrome arthrochalasiatype, Ehlers-Danlos syndrome classical type, Ehlers-Danlos syndromedermatosparaxis type, Ehlers-Danlos syndrome kyphoscoliosis type,vascular type, erythropoietic protoporphyria, Fabry's disease, Facialinjuries and disorders, factor V Leiden thrombophilia, familialadenomatous polyposis, familial dysautonomia, fanconi anemia, FGsyndrome, fragile X syndrome, Friedreich ataxia, Friedreich's ataxia,G6PD deficiency, galactosemia, Gaucher's disease (type 1, 2, and 3),Genetic brain disorders, Glycine encephalopathy, Haemochromatosis type2, Haemochromatosis type 4, Harlequin Ichthyosis, Head and brainmalformations, Hearing disorders and deafness, Hearing problems inchildren, hemochromatosis (neonatal, type 2 and type 3), hemophilia,hepatoerythropoietic porphyria, hereditary coproporphyria, HereditaryMultiple Exostoses, hereditary neuropathy with liability to pressurepalsies, hereditary nonpolyposis colorectal cancer, homocystinuria,Huntington's disease, Hutchinson Gilford Progeria Syndrome,hyperoxaluria, primary, hyperphenylalaninemia, hypochondrogenesis,hypochondroplasia, idic15, incontinentia pigmenti, Infantile Gaucherdisease, infantile-onset ascending hereditary spastic paralysis,Infertility, Jackson-Weiss syndrome, Joubert syndrome, Juvenile PrimaryLateral Sclerosis, Kennedy disease, Klinefelter syndrome, Kniestdysplasia, Krabbe disease, Learning disability, Lesch-Nyhan syndrome,Leukodystrophies, Li-Fraumeni syndrome, lipoprotein lipase deficiency,familial, Male genital disorders, Marfan syndrome, McCune-Albrightsyndrome, McLeod syndrome, Mediterranean fever, familial, MEDNIK, Menkesdisease, Menkes syndrome, Metabolic disorders, methemoglobinemiabeta-globin type, Methemoglobinemia congenital methaemoglobinaemia,methylmalonic acidemia, Micro syndrome, Microcephaly, Movementdisorders, Mowat-Wilson syndrome, Mucopolysaccharidosis (MPS I), Muenkesyndrome, Muscular dystrophy, Muscular dystrophy, Duchenne and Beckertype, muscular dystrophy, Duchenne and Becker types, myotonic dystrophy,Myotonic dystrophy type 1 and type 2, Neonatal hemochromatosis,neurofibromatosis, neurofibromatosis 1, neurofibromatosis 2,Neurofibromatosis type I, neurofibromatosis type II, Neurologicdiseases, Neuromuscular disorders, Niemann-Pick disease, Nonketotichyperglycinemia, nonsyndromic deafness, Nonsyndromic deafness autosomalrecessive, Noonan syndrome, osteogenesis imperfecta (type I and typeIII), otospondylomegaepiphyseal dysplasia, pantothenatekinase-associated neurodegeneration, Patau Syndrome (Trisomy 13),Pendred syndrome, Peutz-Jeghers syndrome, Pfeiffer syndrome,phenylketonuria, porphyria, porphyria cutanea tarda, Prader-Willisyndrome, primary pulmonary hypertension, prion disease, Progeria,propionic acidemia, protein C deficiency, protein S deficiency,pseudo-Gaucher disease, pseudoxanthoma elasticum, Retinal disorders,retinoblastoma, retinoblastoma FA—Friedreich ataxia, Rett syndrome,Rubinstein-Taybi syndrome, SADDAN, Sandhoff disease, sensory andautonomic neuropathy type III, sickle cell anemia, skeletal muscleregeneration, Skin pigmentation disorders, Smith Lemli Opitz Syndrome,Speech and communication disorders, spinal muscular atrophy,spinal-bulbar muscular atrophy, spinocerebellar ataxia,spondyloepimetaphyseal dysplasia, Strudwick type, spondyloepiphysealdysplasia congenita, Stickler syndrome, Stickler syndrome COL2A1,Tay-Sachs disease, tetrahydrobiopterin deficiency, thanatophoricdysplasia, thiamine-responsive megaloblastic anemia with diabetesmellitus and sensorineural deafness, Thyroid disease, Tourette'sSyndrome, Treacher Collins syndrome, triple X syndrome, tuberoussclerosis, Turner syndrome, Usher syndrome, variegate porphyria, vonHippel-Lindau disease, Waardenburg syndrome, Weissenbacher-Zweymüllersyndrome, Wilson disease, Wolf-Hirschhorn syndrome, XerodermaPigmentosum, X-linked severe combined immunodeficiency, X-linkedsideroblastic anemia, and X-linked spinal-bulbar muscle atrophy.

In one embodiment, the subject methods and algorithm are used todiagnose, characterize, and monitor thyroid cancer. Other types ofcancer that can be diagnosed, characterized and/or monitored using thealgorithms and methods of the present invention include but are notlimited to adrenal cortical cancer, anal cancer, aplastic anemia, bileduct cancer, bladder cancer, bone cancer, bone metastasis, centralnervous system (CNS) cancers, peripheral nervous system (PNS) cancers,breast cancer, Castleman's disease, cervical cancer, childhoodNon-Hodgkin's lymphoma, colon and rectum cancer, endometrial cancer,esophagus cancer, Ewing's family of tumors (e.g. Ewing's sarcoma), eyecancer, gallbladder cancer, gastrointestinal carcinoid tumors,gastrointestinal stromal tumors, gestational trophoblastic disease,hairy cell leukemia, Hodgkin's disease, Kaposi's sarcoma, kidney cancer,laryngeal and hypopharyngeal cancer, acute lymphocytic leukemia, acutemyeloid leukemia, children's leukemia, chronic lymphocytic leukemia,chronic myeloid leukemia, liver cancer, lung cancer, lung carcinoidtumors, Non-Hodgkin's lymphoma, male breast cancer, malignantmesothelioma, multiple myeloma, myelodysplastic syndrome,myeloproliferative disorders, nasal cavity and paranasal cancer,nasopharyngeal cancer, neuroblastoma, oral cavity and oropharyngealcancer, osteosarcoma, ovarian cancer, pancreatic cancer, penile cancer,pituitary tumor, prostate cancer, retinoblastoma, rhabdomyosarcoma,salivary gland cancer, sarcoma (adult soft tissue cancer), melanoma skincancer, non-melanoma skin cancer, stomach cancer, testicular cancer,thymus cancer, uterine cancer (e.g. uterine sarcoma), vaginal cancer,vulvar cancer, and Waldenstrom's macroglobulinemia.

In some embodiments, gene expression product markers of the presentinvention may provide increased accuracy of genetic disorder or cancerdiagnosis through the use of multiple gene expression product markers inlow quantity and quality, and statistical analysis using the algorithmsof the present invention. In particular, the present invention provides,but is not limited to, methods of diagnosing, characterizing andclassifying gene expression profiles associated with thyroid cancers.The present invention also provides algorithms for characterizing andclassifying thyroid tissue samples, and kits and compositions useful forthe application of said methods. The disclosure further includes methodsfor running a molecular profiling business.

In one embodiment of the invention, markers and genes can be identifiedto have differential expression in thyroid cancer samples compared tothyroid benign samples. Illustrative examples having a benign pathologyinclude follicular adenoma, Hurthle cell adenoma, lymphocyticthyroiditis, and nodular hyperplasia. Illustrative examples having amalignant pathology include follicular carcinoma, follicular variant ofpapillary thyroid carcinoma, medullary carcinoma, and papillary thyroidcarcinoma.

Biological samples may be treated to extract nucleic acid such as DNA orRNA. The nucleic acid may be contacted with an array of probes of thepresent invention under conditions to allow hybridization. The degree ofhybridization may be assayed in a quantitative matter using a number ofmethods known in the art. In some cases, the degree of hybridization ata probe position may be related to the intensity of signal provided bythe assay, which therefore is related to the amount of complementarynucleic acid sequence present in the sample. Software can be used toextract, normalize, summarize, and analyze array intensity data fromprobes across the human genome or transcriptome including expressedgenes, exons, introns, and miRNAs. In some embodiments, the intensity ofa given probe in either the benign or malignant samples can be comparedagainst a reference set to determine whether differential expression isoccurring in a sample. An increase or decrease in relative intensity ata marker position on an array corresponding to an expressed sequence isindicative of an increase or decrease respectively of expression of thecorresponding expressed sequence. Alternatively, a decrease in relativeintensity may be indicative of a mutation in the expressed sequence.

The resulting intensity values for each sample can be analyzed usingfeature selection techniques including filter techniques which assessthe relevance of features by looking at the intrinsic properties of thedata, wrapper methods which embed the model hypothesis within a featuresubset search, and embedded techniques in which the search for anoptimal set of features is built into a classifier algorithm.

Filter techniques useful in the methods of the present invention include(1) parametric methods such as the use of two sample t-tests, ANOVAanalyses, Bayesian frameworks, and Gamma distribution models (2) modelfree methods such as the use of Wilcoxon rank sum tests, between-withinclass sum of squares tests, rank products methods, random permutationmethods, or TNoM which involves setting a threshold point forfold-change differences in expression between two datasets and thendetecting the threshold point in each gene that minimizes the number ofmissclassifications (3) and multivariate methods such as bivariatemethods, correlation based feature selection methods (CFS), minimumredundancy maximum relevance methods (MRMR), Markov blanket filtermethods, and uncorrelated shrunken centroid methods. Wrapper methodsuseful in the methods of the present invention include sequential searchmethods, genetic algorithms, and estimation of distribution algorithms.Embedded methods useful in the methods of the present invention includerandom forest algorithms, weight vector of support vector machinealgorithms, and weights of logistic regression algorithms.Bioinformatics. 2007 Oct. 1; 23(19):2507-17 provides an overview of therelative merits of the filter techniques provided above for the analysisof intensity data.

Selected features may then be classified using a classifier algorithm.Illustrative algorithms include but are not limited to methods thatreduce the number of variables such as principal component analysisalgorithms, partial least squares methods, and independent componentanalysis algorithms. Illustrative algorithms further include but are notlimited to methods that handle large numbers of variables directly suchas statistical methods and methods based on machine learning techniques.Statistical methods include penalized logistic regression, predictionanalysis of microarrays (PAM), methods based on shrunken centroids,support vector machine analysis, and regularized linear discriminantanalysis. Machine learning techniques include bagging procedures,boosting procedures, random forest algorithms, and combinations thereof.Cancer Inform. 2008; 6: 77-97 provides an overview of the classificationtechniques provided above for the analysis of microarray intensity data.

The markers and genes of the present invention can be utilized tocharacterize the cancerous or non-cancerous status of cells or tissues.The present invention includes a method for diagnosing benign tissues orcells from malignant tissues or cells comprising determining thedifferential expression of a marker or gene in a thyroid sample of asubject wherein said marker or gene is a marker or gene listed in FIG.2-6, 9-13, 16 or 17. The present invention also includes methods fordiagnosing medullary thyroid carcinoma comprising determining thedifferential expression of a marker or gene in a thyroid sample of asubject wherein said marker or gene is a marker or gene listed in FIG.10. The present invention also includes methods for diagnosing thyroidpathology subtypes comprising determining the differential expression ofa marker or gene in a thyroid sample of a subject wherein said marker orgene is a marker or gene listed in FIG. 13. The present invention alsoincludes methods for diagnosing benign tissues or cells from malignanttissues or cells comprising determining the differential expression ofan miRNA in a thyroid sample of a subject wherein said miRNA is an miRNAlisted in FIG. 16 or 17.

In accordance with the foregoing, the differential expression of a gene,genes, markers, miRNAs, or a combination thereof as disclosed herein maybe determined using northern blotting and employing the sequences asidentified in herein to develop probes for this purpose. Such probes maybe composed of DNA or RNA or synthetic nucleotides or a combination ofthe above and may advantageously be comprised of a contiguous stretch ofnucleotide residues matching, or complementary to, a sequence asidentified in FIG. 2-6, 9-13, 16 or 17. Such probes will most usefullycomprise a contiguous stretch of at least 15-200 residues or moreincluding 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 80, 85, 90, 95, 100, 110, 120,130, 140, 150, 160, 175, or 200 nucleotides or more, derived from one ormore of the sequences as identified in FIG. 2-6, 9-13, 16 or 17. Thus,where a single probe binds multiple times to the transcriptome of asample of cells that are cancerous, or are suspected of being cancerous,or predisposed to become cancerous, whereas binding of the same probe toa similar amount of transcriptome derived from the genome of otherwisenon-cancerous cells of the same organ or tissue results in observablymore or less binding, this is indicative of differential expression of agene, multiple genes, markers, or miRNAs comprising, or correspondingto, the sequences identified in FIG. 2-6, 9-13, 16 or 17 from which theprobe sequenced was derived.

In one such embodiment, the elevated expression, as compared to normalcells and/or tissues of the same organ, is determined by measuring therelative rates of transcription of RNA, such as by production ofcorresponding cDNAs and then analyzing the resulting DNA using probesdeveloped from the gene sequences as identified in FIG. 2-6, 9-13, 16 or17. Thus, the levels of cDNA produced by use of reverse transcriptasewith the full RNA complement of a cell suspected of being cancerousproduces a corresponding amount of cDNA that can then be amplified usingpolymerase chain reaction, or some other means, such as linearamplification, isothermal amplification, NASB, or rolling circleamplification, to determine the relative levels of resulting cDNA and,thereby, the relative levels of gene expression.

Increased expression may also be determined using agents thatselectively bind to, and thereby detect, the presence of expressionproducts of the genes disclosed herein. For example, an antibody,possibly a suitably labeled antibody, such as where the antibody isbound to a fluorescent or radiolabel, may be generated against one ofthe polypeptides comprising a sequence as identified in FIGS. 2-6, and9-13, and said antibody will then react with, binding either selectivelyor specifically, to a polypeptide encoded by one of the genes thatcorresponds to a sequence disclosed herein. Such antibody binding,especially relative extent of such binding in samples derived fromsuspected cancerous, as opposed to otherwise non-cancerous, cells andtissues, can then be used as a measure of the extent of expression, orover-expression, of the cancer-related genes identified herein. Thus,the genes identified herein as being over-expressed in cancerous cellsand tissues may be over-expressed due to increased copy number, or dueto over-transcription, such as where the over-expression is due toover-production of a transcription factor that activates the gene andleads to repeated binding of RNA polymerase, thereby generating largethan normal amounts of RNA transcripts, which are subsequentlytranslated into polypeptides, such as the polypeptides comprising aminoacid sequences as identified in FIGS. 2-6, and 9-13. Such analysisprovides an additional means of ascertaining the expression of the genesidentified according to the invention and thereby determining thepresence of a cancerous state in a sample derived from a patient to betested, of the predisposition to develop cancer at a subsequent time insaid patient.

In employing the methods of the invention, it should be borne in mindthat gene or marker expression indicative of a cancerous state need notbe characteristic of every cell found to be cancerous. Thus, the methodsdisclosed herein are useful for detecting the presence of a cancerouscondition within a tissue where less than all cells exhibit the completepattern of over-expression. For example, a set of selected genes ormarkers, comprising sequences homologous under stringent conditions, orat least 90%, preferably 95%, identical to at least one of the sequencesas identified in FIG. 2-6, 9-13, 16 or 17, may be found, usingappropriate probes, either DNA or RNA, to be present in as little as 60%of cells derived from a sample of tumorous, or malignant, tissue whilebeing absent from as much as 60% of cells derived from correspondingnon-cancerous, or otherwise normal, tissue (and thus being present in asmuch as 40% of such normal tissue cells). In one embodiment, suchexpression pattern is found to be present in at least 70% of cells drawnfrom a cancerous tissue and absent from at least 70% of a correspondingnormal, non-cancerous, tissue sample. In another embodiment, suchexpression pattern is found to be present in at least 80% of cells drawnfrom a cancerous tissue and absent from at least 80% of a correspondingnormal, non-cancerous, tissue sample. In another embodiment, suchexpression pattern is found to be present in at least 90% of cells drawnfrom a cancerous tissue and absent from at least 90% of a correspondingnormal, non-cancerous, tissue sample. In another embodiment, suchexpression pattern is found to be present in at least 100% of cellsdrawn from a cancerous tissue and absent from at least 100% of acorresponding normal, non-cancerous, tissue sample, although the latterembodiment may represent a rare occurrence.

In some embodiments molecular profiling includes detection, analysis, orquantification of nucleic acid (DNA, or RNA), protein, or a combinationthereof. The diseases or conditions to be diagnosed by the methods ofthe present invention include for example conditions of abnormal growthin one or more tissues of a subject including but not limited to skin,heart, lung, kidney, breast, pancreas, liver, muscle, smooth muscle,bladder, gall bladder, colon, intestine, brain, esophagus, or prostate.In some embodiments, the tissues analyzed by the methods of the presentinvention include thyroid tissues.

In some embodiments, the diseases or conditions diagnosed by the methodsof the present invention include benign and malignant hyperproliferativedisorders including but not limited to cancers, hyperplasias, orneoplasias. In some cases, the hyperproliferative disorders diagnosed bythe methods of the present invention include but are not limited tobreast cancer such as a ductal carcinoma in duct tissue in a mammarygland, medullary carcinomas, colloid carcinomas, tubular carcinomas, andinflammatory breast cancer; ovarian cancer, including epithelial ovariantumors such as adenocarcinoma in the ovary and an adenocarcinoma thathas migrated from the ovary into the abdominal cavity; uterine cancer;cervical cancer such as adenocarcinoma in the cervix epithelialincluding squamous cell carcinoma and adenocarcinomas; prostate cancer,such as a prostate cancer selected from the following: an adenocarcinomaor an adenocarcinoma that has migrated to the bone; pancreatic cancersuch as epitheliod carcinoma in the pancreatic duct tissue and anadenocarcinoma in a pancreatic duct; bladder cancer such as atransitional cell carcinoma in urinary bladder, urothelial carcinomas(transitional cell carcinomas), tumors in the urothelial cells that linethe bladder, squamous cell carcinomas, adenocarcinomas, and small cellcancers; leukemia such as acute myeloid leukemia (AML), acutelymphocytic leukemia, chronic lymphocytic leukemia, chronic myeloidleukemia, hairy cell leukemia, myelodysplasia, myeloproliferativedisorders, acute myelogenous leukemia (AML), chronic myelogenousleukemia (CML), mastocytosis, chronic lymphocytic leukemia (CLL),multiple myeloma (MM), and myelodysplastic syndrome (MDS); bone cancer;lung cancer such as non-small cell lung cancer (NSCLC), which is dividedinto squamous cell carcinomas, adenocarcinomas, and large cellundifferentiated carcinomas, and small cell lung cancer; skin cancersuch as basal cell carcinoma, melanoma, squamous cell carcinoma andactinic keratosis, which is a skin condition that sometimes developsinto squamous cell carcinoma; eye retinoblastoma; cutaneous orintraocular (eye) melanoma; primary liver cancer (cancer that begins inthe liver); kidney cancer; AIDS-related lymphoma such as diffuse largeB-cell lymphoma, B-cell immunoblastic lymphoma and small non-cleavedcell lymphoma; Kaposi's Sarcoma; viral-induced cancers includinghepatitis B virus (HBV), hepatitis C virus (HCV), and hepatocellularcarcinoma; human lymphotrophic virus-type 1 (HTLV-1) and adult T-cellleukemia/lymphoma; and human papilloma virus (HPV) and cervical cancer;central nervous system cancers (CNS) such as primary brain tumor, whichincludes gliomas (astrocytoma, anaplastic astrocytoma, or glioblastomamultiforme), Oligodendroglioma, Ependymoma, Meningioma, Lymphoma,Schwannoma, and Medulloblastoma; peripheral nervous system (PNS) cancerssuch as acoustic neuromas and malignant peripheral nerve sheath tumor(MPNST) including neurofibromas and schwannomas, malignant fibrouscytoma, malignant fibrous histiocytoma, malignant meningioma, malignantmesothelioma, and malignant mixed Müllerian tumor; oral cavity andoropharyngeal cancer such as, hypopharyngeal cancer, laryngeal cancer,nasopharyngeal cancer, and oropharyngeal cancer; stomach cancer such aslymphomas, gastric stromal tumors, and carcinoid tumors; testicularcancer such as germ cell tumors (GCTs), which include seminomas andnonseminomas, and gonadal stromal tumors, which include Leydig celltumors and Sertoli cell tumors; thymus cancer such as to thymomas,thymic carcinomas, Hodgkin disease, non-Hodgkin lymphomas carcinoids orcarcinoid tumors; rectal cancer; and colon cancer. In some cases, thediseases or conditions diagnosed by the methods of the present inventioninclude but are not limited to thyroid disorders such as for examplebenign thyroid disorders including but not limited to follicularadenomas, Hurthle cell adenomas, lymphocytic thyroiditis, and thyroidhyperplasia. In some cases, the diseases or conditions diagnosed by themethods of the present invention include but are not limited tomalignant thyroid disorders such as for example follicular carcinomas,follicular variant of papillary thyroid carcinomas, medullarycarcinomas, and papillary carcinomas. In some cases, the methods of thepresent invention provide for a diagnosis of a tissue as diseased ornormal. In other cases, the methods of the present invention provide fora diagnosis of normal, benign, or malignant. In some cases, the methodsof the present invention provide for a diagnosis of benign/normal, ormalignant. In some cases, the methods of the present invention providefor a diagnosis of one or more of the specific diseases or conditionsprovided herein.

II. Obtaining a Biological Sample

In some embodiments, the methods of the present invention provide forobtaining a sample from a subject. As used herein, the term subjectrefers to any animal (e.g. a mammal), including but not limited tohumans, non-human primates, rodents, dogs, pigs, and the like. Themethods of obtaining provided herein include methods of biopsy includingfine needle aspiration, core needle biopsy, vacuum assisted biopsy,incisional biopsy, excisional biopsy, punch biopsy, shave biopsy or skinbiopsy. The sample may be obtained from any of the tissues providedherein including but not limited to skin, heart, lung, kidney, breast,pancreas, liver, muscle, smooth muscle, bladder, gall bladder, colon,intestine, brain, prostate, esophagus, or thyroid. Alternatively, thesample may be obtained from any other source including but not limitedto blood, sweat, hair follicle, buccal tissue, tears, menses, feces, orsaliva. In some embodiments of the present invention, a medicalprofessional may obtain a biological sample for testing. In some casesthe medical professional may refer the subject to a testing center orlaboratory for submission of the biological sample. In other cases, thesubject may provide the sample. In some cases, a molecular profilingbusiness of the present invention may obtain the sample.

The sample may be obtained by methods known in the art such as thebiopsy methods provided herein, swabbing, scraping, phlebotomy, or anyother methods known in the art. In some cases, the sample may beobtained, stored, or transported using components of a kit of thepresent invention. In some cases, multiple samples, such as multiplethyroid samples may be obtained for diagnosis by the methods of thepresent invention. In some cases, multiple samples, such as one or moresamples from one tissue type (e.g. thyroid) and one or more samples fromanother tissue (e.g. buccal) may be obtained for diagnosis by themethods of the present invention. In some cases, multiple samples suchas one or more samples from one tissue type (e.g. thyroid) and one ormore samples from another tissue (e.g. buccal) may be obtained at thesame or different times. In some cases, the samples obtained atdifferent times are stored and/or analyzed by different methods. Forexample, a sample may be obtained and analyzed by cytological analysis(routine staining). In some cases, further sample may be obtained from asubject based on the results of a cytological analysis. The diagnosis ofcancer may include an examination of a subject by a physician, nurse orother medical professional. The examination may be part of a routineexamination, or the examination may be due to a specific complaintincluding but not limited to one of the following: pain, illness,anticipation of illness, presence of a suspicious lump or mass, adisease, or a condition. The subject may or may not be aware of thedisease or condition. The medical professional may obtain a biologicalsample for testing. In some cases the medical professional may refer thesubject to a testing center or laboratory for submission of thebiological sample.

In some cases, the subject may be referred to a specialist such as anoncologist, surgeon, or endocrinologist for further diagnosis. Thespecialist may likewise obtain a biological sample for testing or referthe individual to a testing center or laboratory for submission of thebiological sample. In any case, the biological sample may be obtained bya physician, nurse, or other medical professional such as a medicaltechnician, endocrinologist, cytologist, phlebotomist, radiologist, or apulmonologist. The medical professional may indicate the appropriatetest or assay to perform on the sample, or the molecular profilingbusiness of the present disclosure may consult on which assays or testsare most appropriately indicated. The molecular profiling business maybill the individual or medical or insurance provider thereof forconsulting work, for sample acquisition and or storage, for materials,or for all products and services rendered.

In some embodiments of the present invention, a medical professionalneed not be involved in the initial diagnosis or sample acquisition. Anindividual may alternatively obtain a sample through the use of an overthe counter kit. Said kit may contain a means for obtaining said sampleas described herein, a means for storing said sample for inspection, andinstructions for proper use of the kit. In some cases, molecularprofiling services are included in the price for purchase of the kit. Inother cases, the molecular profiling services are billed separately.

A sample suitable for use by the molecular profiling business may be anymaterial containing tissues, cells, nucleic acids, genes, genefragments, expression products, gene expression products, or geneexpression product fragments of an individual to be tested. Methods fordetermining sample suitability and/or adequacy are provided. A samplemay include but is not limited to, tissue, cells, or biological materialfrom cells or derived from cells of an individual. The sample may be aheterogeneous or homogeneous population of cells or tissues. Thebiological sample may be obtained using any method known to the art thatcan provide a sample suitable for the analytical methods describedherein.

The sample may be obtained by non-invasive methods including but notlimited to: scraping of the skin or cervix, swabbing of the cheek,saliva collection, urine collection, feces collection, collection ofmenses, tears, or semen. In other cases, the sample is obtained by aninvasive procedure including but not limited to: biopsy, alveolar orpulmonary lavage, needle aspiration, or phlebotomy. The method of biopsymay further include incisional biopsy, excisional biopsy, punch biopsy,shave biopsy, or skin biopsy. The method of needle aspiration mayfurther include fine needle aspiration, core needle biopsy, vacuumassisted biopsy, or large core biopsy. In some embodiments, multiplesamples may be obtained by the methods herein to ensure a sufficientamount of biological material. Methods of obtaining suitable samples ofthyroid are known in the art and are further described in the ATAGuidelines for thyroid nodule management (Cooper et al. Thyroid Vol. 16No. 2 2006), herein incorporated by reference in its entirety. Genericmethods for obtaining biological samples are also known in the art andfurther described in for example Ramzy, Ibrahim Clinical Cytopathologyand Aspiration Biopsy 2001 which is herein incorporated by reference inits entirety. In one embodiment, the sample is a fine needle aspirate ofa thyroid nodule or a suspected thyroid tumor. In some cases, the fineneedle aspirate sampling procedure may be guided by the use of anultrasound, X-ray, or other imaging device.

In some embodiments of the present invention, the molecular profilingbusiness may obtain the biological sample from a subject directly, froma medical professional, from a third party, or from a kit provided bythe molecular profiling business or a third party. In some cases, thebiological sample may be obtained by the molecular profiling businessafter the subject, a medical professional, or a third party acquires andsends the biological sample to the molecular profiling business. In somecases, the molecular profiling business may provide suitable containers,and excipients for storage and transport of the biological sample to themolecular profiling business.

III. Storing the Sample

In some embodiments, the methods of the present invention provide forstoring the sample for a time such as seconds, minutes, hours, days,weeks, months, years or longer after the sample is obtained and beforethe sample is analyzed by one or more methods of the invention. In somecases, the sample obtained from a subject is subdivided prior to thestep of storage or further analysis such that different portions of thesample are subject to different downstream methods or processesincluding but not limited to storage, cytological analysis, adequacytests, nucleic acid extraction, molecular profiling or a combinationthereof.

In some cases, a portion of the sample may be stored while anotherportion of said sample is further manipulated. Such manipulations mayinclude but are not limited to molecular profiling; cytologicalstaining; nucleic acid (RNA or DNA) extraction, detection, orquantification; gene expression product (RNA or Protein) extraction,detection, or quantification; fixation; and examination. The sample maybe fixed prior to or during storage by any method known to the art suchas using glutaraldehyde, formaldehyde, or methanol. In other cases, thesample is obtained and stored and subdivided after the step of storagefor further analysis such that different portions of the sample aresubject to different downstream methods or processes including but notlimited to storage, cytological analysis, adequacy tests, nucleic acidextraction, molecular profiling or a combination thereof. In some cases,samples are obtained and analyzed by for example cytological analysis,and the resulting sample material is further analyzed by one or moremolecular profiling methods of the present invention. In such cases, thesamples may be stored between the steps of cytological analysis and thesteps of molecular profiling. Samples may be stored upon acquisition tofacilitate transport, or to wait for the results of other analyses. Inanother embodiment, samples may be stored while awaiting instructionsfrom a physician or other medical professional.

The acquired sample may be placed in a suitable medium, excipient,solution, or container for short term or long term storage. Said storagemay require keeping the sample in a refrigerated, or frozen environment.The sample may be quickly frozen prior to storage in a frozenenvironment. The frozen sample may be contacted with a suitablecryopreservation medium or compound including but not limited to:glycerol, ethylene glycol, sucrose, or glucose. A suitable medium,excipient, or solution may include but is not limited to: hanks saltsolution, saline, cellular growth medium, an ammonium salt solution suchas ammonium sulphate or ammonium phosphate, or water. Suitableconcentrations of ammonium salts include solutions of about 0.1 g/ml,0.2 g/ml, 0.3 g/ml, 0.4 g/ml, 0.5 g/ml, 0.6 g/ml, 0.7 g/ml, 0.8 g/ml,0.9 g/ml, 1.0 g/ml, 1.1 g/ml, 1.2 g/ml, 1.3 g/ml, 1.4 g/ml, 1.5 g/ml,1.6 g/ml, 1.7 g/ml, 1.8 g/ml, 1.9 g/ml, 2.0 g/ml, 2.2 g/ml, 2.3 g/ml,2.5 g/ml or higher. The medium, excipient, or solution may or may not besterile.

The sample may be stored at room temperature or at reduced temperaturessuch as cold temperatures (e.g. between about 20° C. and about 0° C.),or freezing temperatures, including for example 0C, −1 C, −2 C, −3 C, −4C, −5 C, −6 C, −7 C, −8 C, −9 C, −10 C, −12 C, −14 C, −15 C, −16 C, −20C, −22 C, −25 C, −28 C, −30 C, −35 C, −40 C, −45 C, −50 C, −60 C, −70 C,−80 C, −100 C, −120 C, −140 C, −180 C, −190 C, or about −200 C. In somecases, the samples may be stored in a refrigerator, on ice or a frozengel pack, in a freezer, in a cryogenic freezer, on dry ice, in liquidnitrogen, or in a vapor phase equilibrated with liquid nitrogen.

The medium, excipient, or solution may contain preservative agents tomaintain the sample in an adequate state for subsequent diagnostics ormanipulation, or to prevent coagulation. Said preservatives may includecitrate, ethylene diamine tetraacetic acid, sodium azide, or thimersol.The medium, excipient or solution may contain suitable buffers or saltssuch as Tris buffers or phosphate buffers, sodium salts (e.g. NaCl),calcium salts, magnesium salts, and the like. In some cases, the samplemay be stored in a commercial preparation suitable for storage of cellsfor subsequent cytological analysis such as but not limited to CytycThinPrep, SurePath, or Monoprep.

The sample container may be any container suitable for storage and ortransport of the biological sample including but not limited to: a cup,a cup with a lid, a tube, a sterile tube, a vacuum tube, a syringe, abottle, a microscope slide, or any other suitable container. Thecontainer may or may not be sterile.

IV. Transportation of the Sample

The methods of the present invention provide for transport of thesample. In some cases, the sample is transported from a clinic,hospital, doctor's office, or other location to a second locationwhereupon the sample may be stored and/or analyzed by for example,cytological analysis or molecular profiling. In some cases, the samplemay be transported to a molecular profiling company in order to performthe analyses described herein. In other cases, the sample may betransported to a laboratory such as a laboratory authorized or otherwisecapable of performing the methods of the present invention such as aClinical Laboratory Improvement Amendments (CLIA) laboratory. The samplemay be transported by the individual from whom the sample derives. Saidtransportation by the individual may include the individual appearing ata molecular profiling business or a designated sample receiving pointand providing a sample. Said providing of the sample may involve any ofthe techniques of sample acquisition described herein, or the sample mayhave already have been acquired and stored in a suitable container asdescribed herein. In other cases the sample may be transported to amolecular profiling business using a courier service, the postalservice, a shipping service, or any method capable of transporting thesample in a suitable manner. In some cases, the sample may be providedto a molecular profiling business by a third party testing laboratory(e.g. a cytology lab). In other cases, the sample may be provided to amolecular profiling business by the subject's primary care physician,endocrinologist or other medical professional. The cost of transport maybe billed to the individual, medical provider, or insurance provider.The molecular profiling business may begin analysis of the sampleimmediately upon receipt, or may store the sample in any mannerdescribed herein. The method of storage may or may not be the same aschosen prior to receipt of the sample by the molecular profilingbusiness.

The sample may be transported in any medium or excipient including anymedium or excipient provided herein suitable for storing the sample suchas a cryopreservation medium or a liquid based cytology preparation. Insome cases, the sample may be transported frozen or refrigerated such asat any of the suitable sample storage temperatures provided herein.

Upon receipt of the sample by the molecular profiling business, arepresentative or licensee thereof, a medical professional, researcher,or a third party laboratory or testing center (e.g. a cytologylaboratory) the sample may be assayed using a variety of routineanalyses known to the art such as cytological assays, and genomicanalysis. Such tests may be indicative of cancer, the type of cancer,any other disease or condition, the presence of disease markers, or theabsence of cancer, diseases, conditions, or disease markers. The testsmay take the form of cytological examination including microscopicexamination as described below. The tests may involve the use of one ormore cytological stains. The biological material may be manipulated orprepared for the test prior to administration of the test by anysuitable method known to the art for biological sample preparation. Thespecific assay performed may be determined by the molecular profilingcompany, the physician who ordered the test, or a third party such as aconsulting medical professional, cytology laboratory, the subject fromwhom the sample derives, or an insurance provider. The specific assaymay be chosen based on the likelihood of obtaining a definite diagnosis,the cost of the assay, the speed of the assay, or the suitability of theassay to the type of material provided.

V. Test for Adequacy

Subsequent to or during sample acquisition, including before or after astep of storing the sample, the biological material may be collected andassessed for adequacy, for example, to assess the suitability of thesample for use in the methods and compositions of the present invention.The assessment may be performed by the individual who obtains thesample, the molecular profiling business, the individual using a kit, ora third party such as a cytological lab, pathologist, endocrinologist,or a researcher. The sample may be determined to be adequate orinadequate for further analysis due to many factors including but notlimited to: insufficient cells, insufficient genetic material,insufficient protein, DNA, or RNA, inappropriate cells for the indicatedtest, or inappropriate material for the indicated test, age of thesample, manner in which the sample was obtained, or manner in which thesample was stored or transported. Adequacy may be determined using avariety of methods known in the art such as a cell staining procedure,measurement of the number of cells or amount of tissue, measurement oftotal protein, measurement of nucleic acid, visual examination,microscopic examination, or temperature or pH determination. In oneembodiment, sample adequacy will be determined from the results ofperforming a gene expression product level analysis experiment. Inanother embodiment sample adequacy will be determined by measuring thecontent of a marker of sample adequacy. Such markers include elementssuch as iodine, calcium, magnesium, phosphorous, carbon, nitrogen,sulfur, iron etc.; proteins such as but not limited to thyroglobulin;cellular mass; and cellular components such as protein, nucleic acid,lipid, or carbohydrate.

In some cases, iodine may be measured by a chemical method such asdescribed in U.S. Pat. No. 3,645,691 which is incorporated herein byreference in its entirety or other chemical methods known in the art formeasuring iodine content. Chemical methods for iodine measurementinclude but are not limited to methods based on the Sandell and Kolthoffreaction. Said reaction proceeds according to the following equation:

2Ce⁴⁺+As³+

2Ce³⁺+As⁵+I.

Iodine has a catalytic effect upon the course of the reaction, i.e., themore iodine present in the preparation to be analyzed, the more rapidlythe reaction proceeds. The speed of reaction is proportional to theiodine concentration. In some cases, this analytical method may carriedout in the following manner:

A predetermined amount of a solution of arsenous oxide As₂O₃ inconcentrated sulfuric or nitric acid is added to the biological sampleand the temperature of the mixture is adjusted to reaction temperature,i.e., usually to a temperature between 20° C. and 60° C. A predeterminedamount of a cerium (IV) sulfate solution in sulfuric or nitric acid isadded thereto. Thereupon, the mixture is allowed to react at thepredetermined temperature for a definite period of time. Said reactiontime is selected in accordance with the order of magnitude of the amountof iodine to be determined and with the respective selected reactiontemperature. The reaction time is usually between about 1 minute andabout 40 minutes. Thereafter, the content of the test solution of cerium(IV) ions is determined photometrically. The lower the photometricallydetermined cerium (IV) ion concentration is, the higher is the speed ofreaction and, consequently, the amount of catalytic agent, i.e., ofiodine. In this manner the iodine of the sample can directly andquantitatively be determined.

In other cases, iodine content of a sample of thyroid tissue may bemeasured by detecting a specific isotope of iodine such as for example¹²³I, ¹²⁴I, ¹²⁵I, and ¹³¹I. In still other cases, the marker may beanother radioisotope such as an isotope of carbon, nitrogen, sulfur,oxygen, iron, phosphorous, or hydrogen. The radioisotope in someinstances may be administered prior to sample collection. Methods ofradioisotope administration suitable for adequacy testing are well knownin the art and include injection into a vein or artery, or by ingestion.A suitable period of time between administration of the isotope andacquisition of thyroid nodule sample so as to effect absorption of aportion of the isotope into the thyroid tissue may include any period oftime between about a minute and a few days or about one week includingabout 1 minute, 2 minutes, 5 minutes, 10 minutes, 15 minutes, ½ an hour,an hour, 8 hours, 12 hours, 24 hours, 48 hours, 72 hours, or about one,one and a half, or two weeks, and may readily be determined by oneskilled in the art. Alternatively, samples may be measured for naturallevels of isotopes such as radioisotopes of iodine, calcium, magnesium,carbon, nitrogen, sulfur, oxygen, iron, phosphorous, or hydrogen.

(i) Cell and/or Tissue Content Adequacy Test

Methods for determining the amount of a tissue include but are notlimited to weighing the sample or measuring the volume of sample.Methods for determining the amount of cells include but are not limitedto counting cells which may in some cases be performed afterdis-aggregation with for example an enzyme such as trypsin orcollagenase or by physical means such as using a tissue homogenizer forexample. Alternative methods for determining the amount of cellsrecovered include but are not limited to quantification of dyes thatbind to cellular material, or measurement of the volume of cell pelletobtained following centrifugation. Methods for determining that anadequate number of a specific type of cell is present include PCR,Q-PCR, RT-PCR, immuno-histochemical analysis, cytological analysis,microscopic, and or visual analysis.

(ii) Nucleic Acid Content Adequacy Test

Samples may be analyzed by determining nucleic acid content afterextraction from the biological sample using a variety of methods knownto the art. In some cases, nucleic acids such as RNA or mRNA isextracted from other nucleic acids prior to nucleic acid contentanalysis. Nucleic acid content may be extracted, purified, and measuredby ultraviolet absorbance, including but not limited to absorbance at260 nanometers using a spectrophotometer. In other cases nucleic acidcontent or adequacy may be measured by fluorometer after contacting thesample with a stain. In still other cases, nucleic acid content oradequacy may be measured after electrophoresis, or using an instrumentsuch as an agilent bioanalyzer for example. It is understood that themethods of the present invention are not limited to a specific methodfor measuring nucleic acid content and or integrity.

In some embodiments, the RNA quantity or yield from a given sample ismeasured shortly after purification using a NanoDrop spectrophotometerin a range of nano- to micrograms. In some embodiments, RNA quality ismeasured using an Agilent 2100 Bioanalyzer instrument, and ischaracterized by a calculated RNA Integrity Number (RIN, 1-10). TheNanoDrop is a cuvette-free spectrophotometer. It uses 1 microliter tomeasure from 5 ng/μl to 3,000 ng/μl of sample. The key features ofNanoDrop include low volume of sample and no cuvette; large dynamicrange 5 ng/μl to 3,000 ng/μl; and it allows quantitation of DNA, RNA andproteins. NanoDrop™ 2000c allows for the analysis of 0.5 μl-2.0 μlsamples, without the need for cuvettes or capillaries.

RNA quality can be measured by a calculated RNA Integrity Number (RIN).The RNA integrity number (RIN) is an algorithm for assigning integrityvalues to RNA measurements. The integrity of RNA is a major concern forgene expression studies and traditionally has been evaluated using the28S to 18S rRNA ratio, a method that has been shown to be inconsistent.The RIN algorithm is applied to electrophoretic RNA measurements andbased on a combination of different features that contribute informationabout the RNA integrity to provide a more robust universal measure. Insome embodiments, RNA quality is measured using an Agilent 2100Bioanalyzer instrument. The protocols for measuring RNA quality areknown and available commercially, for example, at Agilent website.Briefly, in the first step, researchers deposit total RNA sample into anRNA Nano LabChip. In the second step, the LabChip is inserted into theAgilent bioanalyzer and let the analysis run, generating a digitalelectropherogram. In the third step, the new RIN algorithm then analyzesthe entire electrophoretic trace of the RNA sample, including thepresence or absence of degradation products, to determine sampleintegrity. Then, The algorithm assigns a 1 to 10 RIN score, where level10 RNA is completely intact. Because interpretation of theelectropherogram is automatic and not subject to individualinterpretation, universal and unbiased comparison of samples is enabledand repeatability of experiments is improved. The RIN algorithm wasdeveloped using neural networks and adaptive learning in conjunctionwith a large database of eukaryote total RNA samples, which wereobtained mainly from human, rat, and mouse tissues. Advantages of RINinclude obtain a numerical assessment of the integrity of RNA; directlycomparing RNA samples, e.g. before and after archival, compare integrityof same tissue across different labs; and ensuring repeatability ofexperiments, e.g. if RIN shows a given value and is suitable formicroarray experiments, then the RIN of the same value can always beused for similar experiments given that the sameorganism/tissue/extraction method is used (Schroeder A, et al. BMCMolecular Biology 2006, 7:3 (2006)).

In some embodiments, RNA quality is measured on a scale of RIN 1 to 10,10 being highest quality. In one aspect, the present invention providesa method of analyzing gene expression from a sample with an RNA RINvalue equal or less than 6.0. In some embodiments, a sample containingRNA with an RIN number of 1.0, 2.0, 3.0, 4.0, 5.0 or 6.0 is analyzed formicroarray gene expression using the subject methods and algorithms ofthe present invention. In some embodiments, the sample is a fine needleaspirate of thyroid tissue. The sample can be degraded with an RIN aslow as 2.0.

Determination of gene expression in a given sample is a complex,dynamic, and expensive process. RNA samples with RIN≤5.0 are typicallynot used for multi-gene microarray analysis, and may instead be usedonly for single-gene RT-PCR and/or TaqMan assays. This dichotomy in theusefulness of RNA according to quality has thus far limited theusefulness of samples and hampered research efforts. The presentinvention provides methods via which low quality RNA can be used toobtain meaningful multi-gene expression results from samples containinglow concentrations of RNA, for example, thyroid FNA samples.

In addition, samples having a low and/or un-measurable RNA concentrationby NanoDrop normally deemed inadequate for multi-gene expressionprofiling can be measured and analyzed using the subject methods andalgorithms of the present invention. The most sensitive and “state ofthe art” apparatus used to measure nucleic acid yield in the laboratorytoday is the NanoDrop spectrophotometer. Like many quantitativeinstruments of its kind, the accuracy of a NanoDrop measurementdecreases significantly with very low RNA concentration. The minimumamount of RNA necessary for input into a microarray experiment alsolimits the usefulness of a given sample. In the present invention, asample containing a very low amount of nucleic acid can be estimatedusing a combination of the measurements from both the NanoDrop and theBioanalyzer instruments, thereby optimizing the sample for multi-geneexpression assays and analysis.

(iii) Protein Content Adequacy Test

In some cases, protein content in the biological sample may be measuredusing a variety of methods known to the art, including but not limitedto: ultraviolet absorbance at 280 nanometers, cell staining as describedherein, or protein staining with for example coomassie blue, orbichichonic acid. In some cases, protein is extracted from thebiological sample prior to measurement of the sample. In some cases,multiple tests for adequacy of the sample may be performed in parallel,or one at a time. In some cases, the sample may be divided into aliquotsfor the purpose of performing multiple diagnostic tests prior to,during, or after assessing adequacy. In some cases, the adequacy test isperformed on a small amount of the sample which may or may not besuitable for further diagnostic testing. In other cases, the entiresample is assessed for adequacy. In any case, the test for adequacy maybe billed to the subject, medical provider, insurance provider, orgovernment entity.

In some embodiments of the present invention, the sample may be testedfor adequacy soon or immediately after collection. In some cases, whenthe sample adequacy test does not indicate a sufficient amount sample orsample of sufficient quality, additional samples may be taken.

VI. Analysis of Sample

In one aspect, the present invention provides methods for performingmicroarray gene expression analysis with low quantity and quality ofpolynucleotide, such as DNA or RNA. In some embodiments, the presentdisclosure describes methods of diagnosing, characterizing and/ormonitoring a cancer by analyzing gene expression with low quantity andquality of RNA. In one embodiment, the cancer is thyroid cancer. ThyroidRNA can be obtained from fine needle aspirates (FNA). In someembodiments, gene expression profile is obtained from degraded sampleswith an RNA RIN value of 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0 orless. In particular embodiments, gene expression profile is obtainedfrom a sample with an RIN of equal or less than 6, i.e. 6.0, 5.0, 4.0,3.0, 2.0, 1.0 or less. Provided by the present invention are methods bywhich low quality RNA can be used to obtain meaningful gene expressionresults from samples containing low concentrations of nucleic acid, suchas thyroid FNA samples.

Another estimate of sample usefulness is RNA yield, typically measuredin nanogram to microgram amounts for gene expression assays. The mostsensitive and “state of the art” apparatus used to measure nucleic acidyield in the laboratory today is the NanoDrop spectrophotometer. Likemany quantitative instruments of its kind, the accuracy of a NanoDropmeasurement decreases significantly with very low RNA concentration. Theminimum amount of RNA necessary for input into a microarray experimentalso limits the usefulness of a given sample. In some aspects, thepresent invention solves the low RNA concentration problem by estimatingsample input using a combination of the measurements from both theNanoDrop and the Bioanalyzer instruments. Since the quality of dataobtained from a gene expression study is dependent on RNA quantity,meaningful gene expression data can be generated from samples having alow or un-measurable RNA concentration as measured by NanoDrop.

The subject methods and algorithms enable: 1) gene expression analysisof samples containing low amount and/or low quality of nucleic acid; 2)a significant reduction of false positives and false negatives, 3) adetermination of the underlying genetic, metabolic, or signalingpathways responsible for the resulting pathology, 4) the ability toassign a statistical probability to the accuracy of the diagnosis ofgenetic disorders, 5) the ability to resolve ambiguous results, and 6)the ability to distinguish between sub-types of cancer.

Cytological Analysis

Samples may be analyzed by cell staining combined with microscopicexamination of the cells in the biological sample. Cell staining, orcytological examination, may be performed by a number of methods andsuitable reagents known to the art including but not limited to: EAstains, hematoxylin stains, cytostain, papanicolaou stain, eosin, nisslstain, toluidine blue, silver stain, azocarmine stain, neutral red, orjanus green. In some cases the cells are fixed and/or permeablized withfor example methanol, ethanol, glutaraldehyde or formaldehyde prior toor during the staining procedure. In some cases, the cells are notfixed. In some cases, more than one stain is used in combination. Inother cases no stain is used at all. In some cases measurement ofnucleic acid content is performed using a staining procedure, forexample with ethidium bromide, hematoxylin, nissl stain or any nucleicacid stain known to the art.

In some embodiments of the present invention, cells may be smeared ontoa slide by standard methods well known in the art for cytologicalexamination. In other cases, liquid based cytology (LBC) methods may beutilized. In some cases, LBC methods provide for an improved means ofcytology slide preparation, more homogenous samples, increasedsensitivity and specificity, and improved efficiency of handling ofsamples. In liquid based cytology methods, biological samples aretransferred from the subject to a container or vial containing a liquidcytology preparation solution such as for example Cytyc ThinPrep,SurePath, or Monoprep or any other liquid based cytology preparationsolution known in the art. Additionally, the sample may be rinsed fromthe collection device with liquid cytology preparation solution into thecontainer or vial to ensure substantially quantitative transfer of thesample. The solution containing the biological sample in liquid basedcytology preparation solution may then be stored and/or processed by amachine or by one skilled in the art to produce a layer of cells on aglass slide. The sample may further be stained and examined under themicroscope in the same way as a conventional cytological preparation.

In some embodiments of the present invention, samples may be analyzed byimmuno-histochemical staining. Immuno-histochemical staining providesfor the analysis of the presence, location, and distribution of specificmolecules or antigens by use of antibodies in a biological sample (e.g.cells or tissues). Antigens may be small molecules, proteins, peptides,nucleic acids or any other molecule capable of being specificallyrecognized by an antibody. Samples may be analyzed byimmuno-histochemical methods with or without a prior fixing and/orpermeabilization step. In some cases, the antigen of interest may bedetected by contacting the sample with an antibody specific for theantigen and then non-specific binding may be removed by one or morewashes. The specifically bound antibodies may then be detected by anantibody detection reagent such as for example a labeled secondaryantibody, or a labeled avidin/streptavidin. In some cases, the antigenspecific antibody may be labeled directly instead. Suitable labels forimmuno-histochemistry include but are not limited to fluorophores suchas fluoroscein and rhodamine, enzymes such as alkaline phosphatase andhorse radish peroxidase, and radionuclides such as ³²P and ¹²⁵I. Geneproduct markers that may be detected by immuno-histochemical staininginclude but are not limited to Her2/Neu, Ras, Rho, EGFR, VEGFR, UbcH10,RET/PTC1, cytokeratin 20, calcitonin, GAL-3, thyroid peroxidase, andthyroglobulin.

VII. Assay Results

The results of routine cytological or other assays may indicate a sampleas negative (cancer, disease or condition free), ambiguous or suspicious(suggestive of the presence of a cancer, disease or condition),diagnostic (positive diagnosis for a cancer, disease or condition), ornon diagnostic (providing inadequate information concerning the presenceor absence of cancer, disease, or condition). The diagnostic results maybe further classified as malignant or benign. The diagnostic results mayalso provide a score indicating for example, the severity or grade of acancer, or the likelihood of an accurate diagnosis, such as via ap-value, a corrected p-value, or a statistical confidence indicator. Insome cases, the diagnostic results may be indicative of a particulartype of a cancer, disease, or condition, such as for example follicularadenoma, Hurthle cell adenoma, lymphocytic thyroiditis, hyperplasia,follicular carcinoma, follicular variant of papillary thyroid carcinoma,papillary carcinoma, or any of the diseases or conditions providedherein. In some cases, the diagnostic results may be indicative of aparticular stage of a cancer, disease, or condition. The diagnosticresults may inform a particular treatment or therapeutic interventionfor the type or stage of the specific cancer disease or conditiondiagnosed. In some embodiments, the results of the assays performed maybe entered into a database. The molecular profiling company may bill theindividual, insurance provider, medical provider, or government entityfor one or more of the following: assays performed, consulting services,reporting of results, database access, or data analysis. In some casesall or some steps other than molecular profiling are performed by acytological laboratory or a medical professional.

VIII. Molecular Profiling

Cytological assays mark the current diagnostic standard for many typesof suspected tumors including for example thyroid tumors or nodules. Insome embodiments of the present invention, samples that assay asnegative, indeterminate, diagnostic, or non diagnostic may be subjectedto subsequent assays to obtain more information. In the presentinvention, these subsequent assays comprise the steps of molecularprofiling of genomic DNA, RNA, mRNA expression product levels, miRNAlevels, gene expression product levels or gene expression productalternative splicing. In some embodiments of the present invention,molecular profiling means the determination of the number (e.g. copynumber) and/or type of genomic DNA in a biological sample. In somecases, the number and/or type may further be compared to a controlsample or a sample considered normal. In some embodiment, genomic DNAcan be analyzed for copy number variation, such as an increase(amplification) or decrease in copy number, or variants, such asinsertions, deletions, truncations and the like. Molecular profiling maybe performed on the same sample, a portion of the same sample, or a newsample may be acquired using any of the methods described herein. Themolecular profiling company may request additional sample by directlycontacting the individual or through an intermediary such as aphysician, third party testing center or laboratory, or a medicalprofessional. In some cases, samples are assayed using methods andcompositions of the molecular profiling business in combination withsome or all cytological staining or other diagnostic methods. In othercases, samples are directly assayed using the methods and compositionsof the molecular profiling business without the previous use of routinecytological staining or other diagnostic methods. In some cases theresults of molecular profiling alone or in combination with cytology orother assays may enable those skilled in the art to diagnose or suggesttreatment for the subject. In some cases, molecular profiling may beused alone or in combination with cytology to monitor tumors orsuspected tumors over time for malignant changes.

The molecular profiling methods of the present invention provide forextracting and analyzing protein or nucleic acid (RNA or DNA) from oneor more biological samples from a subject. In some cases, nucleic acidis extracted from the entire sample obtained. In other cases, nucleicacid is extracted from a portion of the sample obtained. In some cases,the portion of the sample not subjected to nucleic acid extraction maybe analyzed by cytological examination or immuno-histochemistry. Methodsfor RNA or DNA extraction from biological samples are well known in theart and include for example the use of a commercial kit, such as theQiagen DNeasy Blood and Tissue Kit, or the Qiagen EZ1 RNA UniversalTissue Kit.

(i) Tissue-Type Fingerprinting

In many cases, biological samples such as those provided by the methodsof the present invention of may contain several cell types or tissues,including but not limited to thyroid follicular cells, thyroid medullarycells, blood cells (RBCs, WBCs, platelets), smooth muscle cells, ducts,duct cells, basement membrane, lumen, lobules, fatty tissue, skin cells,epithelial cells, and infiltrating macrophages and lymphocytes. In thecase of thyroid samples, diagnostic classification of the biologicalsamples may involve for example primarily follicular cells (for cancersderived from the follicular cell such as papillary carcinoma, follicularcarcinoma, and anaplastic thyroid carcinoma) and medullary cells (formedullary cancer). The diagnosis of indeterminate biological samplesfrom thyroid biopsies in some cases concerns the distinction offollicular adenoma vs. follicular carcinoma. The molecular profilingsignal of a follicular cell for example may thus be diluted out andpossibly confounded by other cell types present in the sample. Similarlydiagnosis of biological samples from other tissues or organs ofteninvolves diagnosing one or more cell types among the many that may bepresent in the sample.

In some embodiments, the methods of the present invention provide for anupfront method of determining the cellular make-up of a particularbiological sample so that the resulting molecular profiling signaturescan be calibrated against the dilution effect due to the presence ofother cell and/or tissue types. In one aspect, this upfront method is analgorithm that uses a combination of known cell and/or tissue specificgene expression patterns as an upfront mini-classifier for eachcomponent of the sample. This algorithm utilizes this molecularfingerprint to pre-classify the samples according to their compositionand then apply a correction/normalization factor. This data may in somecases then feed in to a final classification algorithm which wouldincorporate that information to aid in the final diagnosis.

(ii) Genomic Analysis

In some embodiments, genomic sequence analysis, or genotyping, may beperformed on the sample. This genotyping may take the form of mutationalanalysis such as single nucleotide polymorphism (SNP) analysis,insertion deletion polymorphism (InDel) analysis, variable number oftandem repeat (VNTR) analysis, copy number variation (CNV) analysis orpartial or whole genome sequencing. Methods for performing genomicanalyses are known to the art and may include high throughput sequencingsuch as but not limited to those methods described in U.S. Pat. Nos.7,335,762; 7,323,305; 7,264,929; 7,244,559; 7,211,390; 7,361,488;7,300,788; and 7,280,922. Methods for performing genomic analyses mayalso include microarray methods as described hereinafter. In some cases,genomic analysis may be performed in combination with any of the othermethods herein. For example, a sample may be obtained, tested foradequacy, and divided into aliquots. One or more aliquots may then beused for cytological analysis of the present invention, one or more maybe used for RNA expression profiling methods of the present invention,and one or more can be used for genomic analysis. It is furtherunderstood the present invention anticipates that one skilled in the artmay wish to perform other analyses on the biological sample that are notexplicitly provided herein.

(iii) Expression Product Profiling

Gene expression profiling is the measurement of the activity (theexpression) of thousands of genes at once, to create a global picture ofcellular function. These profiles can, for example, distinguish betweencells that are actively dividing, or show how the cells react to aparticular treatment. Many experiments of this sort measure an entiregenome simultaneously, that is, every gene present in a particular cell.Microarray technology measures the relative activity of previouslyidentified target genes. Sequence based techniques, like serial analysisof gene expression (SAGE, SuperSAGE) are also used for gene expressionprofiling. SuperSAGE is especially accurate and can measure any activegene, not just a predefined set. In an RNA, mRNA or gene expressionprofiling microarray, the expression levels of thousands of genes aresimultaneously monitored to study the effects of certain treatments,diseases, and developmental stages on gene expression. For example,microarray-based gene expression profiling can be used to characterizegene signatures of a genetic disorder disclosed herein, or differentcancer types, subtypes of a cancer, and/or cancer stages.

Expression profiling experiments often involve measuring the relativeamount of gene expression products, such as mRNA, expressed in two ormore experimental conditions. This is because altered levels of aspecific sequence of a gene expression product suggest a changed needfor the protein coded for by the gene expression product, perhapsindicating a homeostatic response or a pathological condition. Forexample, if breast cancer cells express higher levels of mRNA associatedwith a particular transmembrane receptor than normal cells do, it mightbe that this receptor plays a role in breast cancer. One aspect of thepresent invention encompasses gene expression profiling as part of animportant diagnostic test for genetic disorders and cancers,particularly, thyroid cancer.

In some embodiments, RNA samples with RIN are typically not used formulti-gene microarray analysis, and may instead be used only forsingle-gene RT-PCR and/or TaqMan assays. Microarray, RT-PCR and TaqManassays are standard molecular techniques well known in the relevant art.TaqMan probe-based assays are widely used in real-time PCR includinggene expression assays, DNA quantification and SNP genotyping.

In one embodiment, gene expression products related to cancer that areknown to the art are profiled. Such gene expression products have beendescribed and include but are not limited to the gene expressionproducts detailed in U.S. Pat. Nos. 7,358,061; 7,319,011; 5,965,360;6,436,642; and US patent applications 2003/0186248, 2005/0042222,2003/0190602, 2005/0048533, 2005/0266443, 2006/0035244, 2006/083744,2006/0088851, 2006/0105360, 2006/0127907, 2007/0020657, 2007/0037186,2007/0065833, 2007/0161004, 2007/0238119, and 2008/0044824.

It is further anticipated that other gene expression products related tocancer may become known, and that the methods and compositions describedherein may include such newly discovered gene expression products.

In some embodiments of the present invention gene expression productsare analyzed alternatively or additionally for characteristics otherthan expression level. For example, gene products may be analyzed foralternative splicing. Alternative splicing, also referred to asalternative exon usage, is the RNA splicing variation mechanism whereinthe exons of a primary gene transcript, the pre-mRNA, are separated andreconnected (i.e. spliced) so as to produce alternative mRNA moleculesfrom the same gene. In some cases, these linear combinations thenundergo the process of translation where a specific and unique sequenceof amino acids is specified by each of the alternative mRNA moleculesfrom the same gene resulting in protein isoforms. Alternative splicingmay include incorporating different exons or different sets of exons,retaining certain introns, or using utilizing alternate splice donor andacceptor sites.

In some cases, markers or sets of markers may be identified that exhibitalternative splicing that is diagnostic for benign, malignant or normalsamples. Additionally, alternative splicing markers may further providea diagnosis for the specific type of thyroid cancer (e.g. papillary,follicular, medullary, or anaplastic). Alternative splicing markersdiagnostic for malignancy known to the art include those listed in U.S.Pat. No. 6,436,642.

In some cases expression of RNA expression products that do not encodefor proteins such as miRNAs, and siRNAs may be assayed by the methods ofthe present invention. Differential expression of these RNA expressionproducts may be indicative of benign, malignant or normal samples.Differential expression of these RNA expression products may further beindicative of the subtype of the benign sample (e.g. FA, NHP, LCT, BN,CN, HA) or malignant sample (e.g. FC, PTC, FVPTC, ATC, MTC). In somecases, differential expression of miRNAs, siRNAs, alternative splice RNAisoforms, mRNAs or any combination thereof may be assayed by the methodsof the present invention.

In some embodiments, the current invention provides 16 panels ofbiomarkers, each panel being required to characterize, rule out, anddiagnose pathology within the thyroid. The sixteen panels are:

1 Normal Thyroid (NML)

2 Lymphocytic, Autoimmune Thyroiditis (LCT)

3 Nodular Hyperplasia (NHP)

4 Follicular Thyroid Adenoma (FA)

5 Hurthle Cell Thyroid Adenoma (HC)

6 Parathyroid (non thyroid tissue)

7 Anaplastic Thyroid Carcinoma (ATC)

8 Follicular Thyroid Carcinoma (FC)

9 Hurthle Cell Thyroid Carcinoma (HC)

10 Papillary Thyroid Carcinoma (PTC)

11 Follicular Variant of Papillary Carcinoma (FVPTC)

12 Medullary Thyroid Carcinoma (MTC)

13 Renal Carcinoma metastasis to the Thyroid

14 Melanoma metastasis to the Thyroid

15 B cell Lymphoma metastasis to the Thyroid

16 Breast Carcinoma metastasis to the Thyroid

Each panel includes a set of biomarkers required to characterize, ruleout, and diagnose a given pathology within the thyroid. Panels 1-6describe benign pathology. Panels 7-16 describe malignant pathology.

The biological nature of the thyroid and each pathology found within it,suggests that there is redundancy between the plurality of biomarkers inone panel versus the plurality of biomarkers in another panel. Mirroringeach pathology subtype, each diagnostic panel is heterogeneous andsemi-redundant with the biomarkers in another panel. Heterogeneity andredundancy reflect the biology of the tissues sampled in a given FNA andthe differences in gene expression that characterize each pathologysubtype from one another.

In one aspect, the diagnostic value of the present invention lies in thecomparison of i) one or more markers in one panel, versus ii) one ormore markers in each additional panel. The utility of the invention isits higher diagnostic accuracy in FNA than presently possible by anyother means.

In some embodiments, the biomarkers within each panel areinterchangeable (modular). The plurality of biomarkers in all panels canbe substituted, increased, reduced, or improved to accommodate thedefinition of new pathologic subtypes (e.g. new case reports ofmetastasis to the thyroid from other organs). The current inventiondescribes the plurality of markers that define each of sixteenheterogeneous, semi-redundant, and distinct pathologies found in thethyroid. All sixteen panels are required to arrive at an accuratediagnosis, and any given panel alone does not have sufficient power tomake a true diagnostic determination. In some embodiments, thebiomarkers in each panel are interchanged with a suitable combination ofbiomarkers, such that the plurality of biomarkers in each panel stilldefines a given pathology subtype within the context of examining theplurality of biomarkers that define all other pathology subtypes.

Methods and compositions of the invention can have genes selected from1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 or 16 or morebiomarker panels and can have from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15,20, 25, 30, 35, 40, 45, 50 or more gene expression products from eachbiomarker panel, in any combination. In some embodiments, the set ofgenes combined give a specificity or sensitivity of greater than 70%,75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%,97%, 98%, 99%, or 99.5%, or a positive predictive value or negativepredictive value of at least 95%, 95.5%, 96%, 96.5%, 97%, 97.5%, 98%,98.5%, 99%, 99.5% or more.

(1) In Vitro Methods of Determining Expression Product Levels

The general methods for determining gene expression product levels areknown to the art and may include but are not limited to one or more ofthe following: additional cytological assays, assays for specificproteins or enzyme activities, assays for specific expression productsincluding protein or RNA or specific RNA splice variants, in situhybridization, whole or partial genome expression analysis, microarrayhybridization assays, SAGE, enzyme linked immuno-absorbance assays,mass-spectrometry, immuno-histochemistry, or blotting. Gene expressionproduct levels may be normalized to an internal standard such as totalmRNA or the expression level of a particular gene including but notlimited to glyceraldehyde 3 phosphate dehydrogenase, or tublin.

In some embodiments of the present invention, gene expression productmarkers and alternative splicing markers may be determined by microarrayanalysis using, for example, Affymetrix arrays, cDNA microarrays,oligonucleotide microarrays, spotted microarrays, or other microarrayproducts from Biorad, Agilent, or Eppendorf. Microarrays provideparticular advantages because they may contain a large number of genesor alternative splice variants that may be assayed in a singleexperiment. In some cases, the microarray device may contain the entirehuman genome or transcriptome or a substantial fraction thereof allowinga comprehensive evaluation of gene expression patterns, genomicsequence, or alternative splicing. Markers may be found using standardmolecular biology and microarray analysis techniques as described inSambrook Molecular Cloning a Laboratory Manual 2001 and Baldi, P., andHatfield, W. G., DNA Microarrays and Gene Expression 2002.

Microarray analysis begins with extracting and purifying nucleic acidfrom a biological sample, (e.g. a biopsy or fine needle aspirate) usingmethods known to the art. For expression and alternative splicinganalysis it may be advantageous to extract and/or purify RNA from DNA.It may further be advantageous to extract and/or purify mRNA from otherforms of RNA such as tRNA and rRNA.

Purified nucleic acid may further be labeled with a fluorescent,radionuclide, or chemical label such as biotin or digoxin for example byreverse transcription, PCR, ligation, chemical reaction or othertechniques. The labeling can be direct or indirect which may furtherrequire a coupling stage. The coupling stage can occur beforehybridization, for example, using aminoallyl-UTP and NHS amino-reactivedyes (like cyanine dyes) or after, for example, using biotin andlabelled streptavidin. The modified nucleotides (e.g. at a 1 aaUTP: 4TTP ratio) are added enzymatically at a lower rate compared to normalnucleotides, typically resulting in 1 every 60 bases (measured with aspectrophotometer). The aaDNA may then be purified with, for example, acolumn or a diafiltration device. The aminoallyl group is an amine groupon a long linker attached to the nucleobase, which reacts with areactive label (e.g. a fluorescent dye).

The labeled samples may then be mixed with a hybridization solutionwhich may contain SDS, SSC, dextran sulfate, a blocking agent (such asCOT1 DNA, salmon sperm DNA, calf thymum DNA, PolyA or PolyT), Denhardt'ssolution, formamine, or a combination thereof.

A hybridization probe is a fragment of DNA or RNA of variable length,which is used to detect in DNA or RNA samples the presence of nucleotidesequences (the DNA target) that are complementary to the sequence in theprobe. The probe thereby hybridizes to single-stranded nucleic acid (DNAor RNA) whose base sequence allows probe-target base pairing due tocomplementarity between the probe and target. The labeled probe is firstdenatured (by heating or under alkaline conditions) into single DNAstrands and then hybridized to the target DNA.

To detect hybridization of the probe to its target sequence, the probeis tagged (or labeled) with a molecular marker; commonly used markersare ³²P or Digoxigenin, which is non-radioactive antibody-based marker.DNA sequences or RNA transcripts that have moderate to high sequencesimilarity to the probe are then detected by visualizing the hybridizedprobe via autoradiography or other imaging techniques. Detection ofsequences with moderate or high similarity depends on how stringent thehybridization conditions were applied—high stringency, such as highhybridization temperature and low salt in hybridization buffers, permitsonly hybridization between nucleic acid sequences that are highlysimilar, whereas low stringency, such as lower temperature and highsalt, allows hybridization when the sequences are less similar.Hybridization probes used in DNA microarrays refer to DNA covalentlyattached to an inert surface, such as coated glass slides or gene chips,and to which a mobile cDNA target is hybridized.

This mix may then be denatured by heat or chemical means and added to aport in a microarray. The holes may then be sealed and the microarrayhybridized, for example, in a hybridization oven, where the microarrayis mixed by rotation, or in a mixer. After an overnight hybridization,non specific binding may be washed off (e.g. with SDS and SSC). Themicroarray may then be dried and scanned in a special machine where alaser excites the dye and a detector measures its emission. The imagemay be overlaid with a template grid and the intensities of the features(several pixels make a feature) may be quantified.

Various kits can be used for the amplification of nucleic acid and probegeneration of the subject methods. Examples of kit that can be used inthe present invention include but are not limited to Nugen WT-OvationFFPE kit, cDNA amplification kit with Nugen Exon Module and Frag/Labelmodule. The NuGEN WT-Ovation™ FFPE System V2 is a whole transcriptomeamplification system that enables conducting global gene expressionanalysis on the vast archives of small and degraded RNA derived fromFFPE samples. The system is comprised of reagents and a protocolrequired for amplification of as little as 50 ng of total FFPE RNA. Theprotocol can be used for qPCR, sample archiving, fragmentation, andlabeling. The amplified cDNA can be fragmented and labeled in less thantwo hours for GeneChip® 3′ expression array analysis using NuGEN'sFL-Ovation™ cDNA Biotin Module V2. For analysis using AffymetrixGeneChip® Exon and Gene ST arrays, the amplified cDNA can be used withthe WT-Ovation Exon Module, then fragmented and labeled using theFL-Ovation™ cDNA Biotin Module V2. For analysis on Agilent arrays, theamplified cDNA can be fragmented and labeled using NuGEN's FL-Ovation™cDNA Fluorescent Module. More information on Nugen WT-Ovation FFPE kitcan be obtained athttp://www.nugeninc.com/nugen/index.cfm/products/amplification-systems/wt-ovation-ffpe/.

In some embodiments, Ambion WT-expression kit can be used. AmbionWT-expression kit allows amplification of total RNA directly without aseparate ribosomal RNA (rRNA) depletion step. With the Ambion® WTExpression Kit, samples as small as 50 ng of total RNA can be analyzedon Affymetrix® GeneChip® Human, Mouse, and Rat Exon and Gene 1.0 STArrays. In addition to the lower input RNA requirement and highconcordance between the Affymetrix® method and TaqMan® real-time PCRdata, the Ambion® WT Expression Kit provides a significant increase insensitivity. For example, a greater number of probe sets detected abovebackground can be obtained at the exon level with the Ambion® WTExpression Kit as a result of an increased signal-to-noise ratio. AmbionWT-expression kit may be used in combination with additional Affymetrixlabeling kit.

In some embodiments, AmpTec Trinucleotide Nano mRNA Amplification kit(6299-A15) can be used in the subject methods. The ExpressArt®TRinucleotide mRNA amplification Nano kit is suitable for a wide range,from 1 ng to 700 ng of input total RNA. According to the amount of inputtotal RNA and the required yields of aRNA, it can be used for 1-round(input >300 ng total RNA) or 2-rounds (minimal input amount 1 ng totalRNA), with aRNA yields in the range of >10 μg. AmpTec's proprietaryTRinucleotide priming technology results in preferential amplificationof mRNAs (independent of the universal eukaryotic 3′-poly(A)-sequence),combined with selection against rRNAs. More information on AmpTecTrinucleotide Nano mRNA Amplification kit can be obtained athttp://www.amp-tec.com/products.htm. This kit can be used in combinationwith cDNA conversion kit and Affymetrix labeling kit.

The raw data may then be normalized, for example, by subtracting thebackground intensity and then dividing the intensities making either thetotal intensity of the features on each channel equal or the intensitiesof a reference gene and then the t-value for all the intensities may becalculated. More sophisticated methods, include z-ratio, loess andlowess regression and RMA (robust multichip analysis) for Affymetrixchips.

(2) In Vivo Methods of Determining Gene Expression Product Levels

It is further anticipated that the methods and compositions of thepresent invention may be used to determine gene expression productlevels in an individual without first obtaining a sample. For example,gene expression product levels may be determined in vivo, that is in theindividual. Methods for determining gene expression product levels invivo are known to the art and include imaging techniques such as CAT,MRI; NMR; PET; and optical, fluorescence, or biophotonic imaging ofprotein or RNA levels using antibodies or molecular beacons. Suchmethods are described in US 2008/0044824, US 2008/0131892, hereinincorporated by reference. Additional methods for in vivo molecularprofiling are contemplated to be within the scope of the presentinvention.

In some embodiments of the present invention, molecular profilingincludes the step of binding the sample or a portion of the sample toone or more probes of the present invention. Suitable probes bind tocomponents of the sample, i.e. gene products, that are to be measuredand include but are not limited to antibodies or antibody fragments,aptamers, nucleic acids, and oligonucleotides. The binding of the sampleto the probes of the present invention represents a transformation ofmatter from sample to sample bound to one or more probes. The method ofdiagnosing cancer based on molecular profiling further comprises thesteps of detecting gene expression products (i.e. mRNA or protein) andlevels of the sample, comparing it to an amount in a normal controlsample to determine the differential gene expression product levelbetween the sample and the control; and classifying the test sample byinputting one or more differential gene expression product levels to atrained algorithm of the present invention; validating the sampleclassification using the selection and classification algorithms of thepresent invention; and identifying the sample as positive for a geneticdisorder or a type of cancer.

(i) Comparison of Sample to Normal

The results of the molecular profiling performed on the sample providedby the individual (test sample) may be compared to a biological samplethat is known or suspected to be normal. A normal sample is that whichis or is expected to be free of any cancer, disease, or condition, or asample that would test negative for any cancer disease or condition inthe molecular profiling assay. The normal sample may be from a differentindividual from the individual being tested, or from the sameindividual. In some cases, the normal sample is a sample obtained from abuccal swab of an individual such as the individual being tested forexample. The normal sample may be assayed at the same time, or at adifferent time from the test sample.

The results of an assay on the test sample may be compared to theresults of the same assay on a normal sample. In some cases the resultsof the assay on the normal sample are from a database, or a reference.In some cases, the results of the assay on the normal sample are a knownor generally accepted value by those skilled in the art. In some casesthe comparison is qualitative. In other cases the comparison isquantitative. In some cases, qualitative or quantitative comparisons mayinvolve but are not limited to one or more of the following: comparingfluorescence values, spot intensities, absorbance values,chemiluminescent signals, histograms, critical threshold values,statistical significance values, gene product expression levels, geneproduct expression level changes, alternative exon usage, changes inalternative exon usage, protein levels, DNA polymorphisms, coy numbervariations, indications of the presence or absence of one or more DNAmarkers or regions, or nucleic acid sequences.

(ii) Evaluation of Results

In some embodiments, the molecular profiling results are evaluated usingmethods known to the art for correlating gene product expression levelsor alternative exon usage with specific phenotypes such as malignancy,the type of malignancy (e.g. follicular carcinoma), benignancy, ornormalcy (e.g. disease or condition free). In some cases, a specifiedstatistical confidence level may be determined in order to provide adiagnostic confidence level. For example, it may be determined that aconfidence level of greater than 90% may be a useful predictor ofmalignancy, type of malignancy, or benignancy. In other embodiments,more or less stringent confidence levels may be chosen. For example, aconfidence level of approximately 70%, 75%, 80%, 85%, 90%, 95%, 97.5%,99%, 99.5%, or 99.9% may be chosen as a useful phenotypic predictor. Theconfidence level provided may in some cases be related to the quality ofthe sample, the quality of the data, the quality of the analysis, thespecific methods used, and the number of gene expression productsanalyzed. The specified confidence level for providing a diagnosis maybe chosen on the basis of the expected number of false positives orfalse negatives and/or cost. Methods for choosing parameters forachieving a specified confidence level or for identifying markers withdiagnostic power include but are not limited to Receiver Operator Curveanalysis (ROC), binormal ROC, principal component analysis, partialleast squares analysis, singular value decomposition, least absoluteshrinkage and selection operator analysis, least angle regression, andthe threshold gradient directed regularization method.

(iii) Data Analysis

Raw gene expression level and alternative splicing data may in somecases be improved through the application of algorithms designed tonormalize and or improve the reliability of the data. In someembodiments of the present invention the data analysis requires acomputer or other device, machine or apparatus for application of thevarious algorithms described herein due to the large number ofindividual data points that are processed. A “machine learningalgorithm” refers to a computational-based prediction methodology, alsoknown to persons skilled in the art as a “classifier”, employed forcharacterizing a gene expression profile. The signals corresponding tocertain expression levels, which are obtained by, e.g., microarray-basedhybridization assays, are typically subjected to the algorithm in orderto classify the expression profile. Supervised learning generallyinvolves “training” a classifier to recognize the distinctions amongclasses and then “testing” the accuracy of the classifier on anindependent test set. For new, unknown samples the classifier can beused to predict the class in which the samples belong.

In some cases, the robust multi-array Average (RMA) method may be usedto normalize the raw data. The RMA method begins by computingbackground-corrected intensities for each matched cell on a number ofmicroarrays. The background corrected values are restricted to positivevalues as described by Irizarry et al. Biostatistics 2003 Apr. 4 (2):249-64. After background correction, the base-2 logarithm of eachbackground corrected matched-cell intensity is then obtained. Theback-ground corrected, log-transformed, matched intensity on eachmicroarray is then normalized using the quantile normalization method inwhich for each input array and each probe expression value, the arraypercentile probe value is replaced with the average of all arraypercentile points, this method is more completely described by Bolstadet al. Bioinformatics 2003. Following quantile normalization, thenormalized data may then be fit to a linear model to obtain anexpression measure for each probe on each microarray. Tukey's medianpolish algorithm (Tukey, J. W., Exploratory Data Analysis. 1977) maythen be used to determine the log-scale expression level for thenormalized probe set data.

Data may further be filtered to remove data that may be consideredsuspect. In some embodiments, data deriving from microarray probes thathave fewer than about 4, 5, 6, 7 or 8 guanosine+cytosine nucleotides maybe considered to be unreliable due to their aberrant hybridizationpropensity or secondary structure issues. Similarly, data deriving frommicroarray probes that have more than about 12, 13, 14, 15, 16, 17, 18,19, 20, 21, or 22 guanosine+cytosine nucleotides may be consideredunreliable due to their aberrant hybridization propensity or secondarystructure issues.

In some cases, unreliable probe sets may be selected for exclusion fromdata analysis by ranking probe-set reliability against a series ofreference datasets. For example, RefSeq or Ensembl (EMBL) are consideredvery high quality reference datasets. Data from probe sets matchingRefSeq or Ensembl sequences may in some cases be specifically includedin microarray analysis experiments due to their expected highreliability. Similarly data from probe-sets matching less reliablereference datasets may be excluded from further analysis, or consideredon a case by case basis for inclusion. In some cases, the Ensembl highthroughput cDNA (HTC) and/or mRNA reference datasets may be used todetermine the probe-set reliability separately or together. In othercases, probe-set reliability may be ranked. For example, probes and/orprobe-sets that match perfectly to all reference datasets such as forexample RefSeq, HTC, and mRNA, may be ranked as most reliable (1).Furthermore, probes and/or probe-sets that match two out of threereference datasets may be ranked as next most reliable (2), probesand/or probe-sets that match one out of three reference datasets may beranked next (3) and probes and/or probe sets that match no referencedatasets may be ranked last (4). Probes and or probe-sets may then beincluded or excluded from analysis based on their ranking. For example,one may choose to include data from category 1, 2, 3, and 4 probe-sets;category 1, 2, and 3 probe-sets; category 1 and 2 probe-sets; orcategory 1 probe-sets for further analysis. In another example,probe-sets may be ranked by the number of base pair mismatches toreference dataset entries. It is understood that there are many methodsunderstood in the art for assessing the reliability of a given probeand/or probe-set for molecular profiling and the methods of the presentinvention encompass any of these methods and combinations thereof.

In some embodiments of the present invention, data from probe-sets maybe excluded from analysis if they are not expressed or expressed at anundetectable level (not above background). A probe-set is judged to beexpressed above background if for any group:

Integral from T0 to Infinity of the standard normal distribution<Significance (0.01)

Where:

T0=Sqr(GroupSize)(T−P)/Sqr(Pvar),

GroupSize=Number of CEL files in the group,T=Average of probe scores in probe-set,P=Average of Background probes averages of GC content, andPvar=Sum of Background probe variances/(Number of probes inprobe-set)^(∧)2,

This allows including probe-sets in which the average of probe-sets in agroup is greater than the average expression of background probes ofsimilar GC content as the probe-set probes as the center of backgroundfor the probe-set and enables one to derive the probe-set dispersionfrom the background probe-set variance.

In some embodiments of the present invention, probe-sets that exhibitno, or low variance may be excluded from further analysis. Low-varianceprobe-sets are excluded from the analysis via a Chi-Square test. Aprobe-set is considered to be low-variance if its transformed varianceis to the left of the 99 percent confidence interval of the Chi-Squareddistribution with (N−1) degrees of freedom.

(N−1)*Probe-set Variance/(Gene Probe-set Variance)˜Chi-Sq(N−1)

where N is the number of input CEL files, (N−1) is the degrees offreedom for the Chi-Squared distribution, and the ‘probe-set variancefor the gene’ is the average of probe-set variances across the gene.

In some embodiments of the present invention, probe-sets for a givengene or transcript cluster may be excluded from further analysis if theycontain less than a minimum number of probes that pass through thepreviously described filter steps for GC content, reliability, varianceand the like. For example in some embodiments, probe-sets for a givengene or transcript cluster may be excluded from further analysis if theycontain less than about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, or less than about 20 probes.

Methods of data analysis of gene expression levels or of alternativesplicing may further include the use of a feature selection algorithm asprovided herein. In some embodiments of the present invention, featureselection is provided by use of the LIMMA software package (Smyth, G. K.(2005). Limma: linear models for microarray data. In: Bioinformatics andComputational Biology Solutions using R and Bioconductor, R. Gentleman,V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds.), Springer, New York,pages 397-420).

Methods of data analysis of gene expression levels and or of alternativesplicing may further include the use of a pre-classifier algorithm. Forexample, an algorithm may use a cell-specific molecular fingerprint topre-classify the samples according to their composition and then apply acorrection/normalization factor. This data/information may then be fedin to a final classification algorithm which would incorporate thatinformation to aid in the final diagnosis.

Methods of data analysis of gene expression levels and or of alternativesplicing may further include the use of a classifier algorithm asprovided herein. In some embodiments of the present invention a supportvector machine (SVM) algorithm, a random forest algorithm, or acombination thereof is provided for classification of microarray data.In some embodiments, identified markers that distinguish samples (e.g.benign vs. malignant, normal vs. malignant) or distinguish subtypes(e.g. PTC vs. FVPTC) are selected based on statistical significance. Insome cases, the statistical significance selection is performed afterapplying a Benjamini Hochberg correction for false discovery rate (FDR).

In some cases, the classifier algorithm may be supplemented with ameta-analysis approach such as that described by Fishel and Kaufman etal. 2007 Bioinformatics 23(13): 1599-606. In some cases, the classifieralgorithm may be supplemented with a meta-analysis approach such as arepeatability analysis. In some cases, the repeatability analysisselects markers that appear in at least one predictive expressionproduct marker set.

In some cases, the results of feature selection and classification maybe ranked using a Bayesian post-analysis method. For example, microarraydata may be extracted, normalized, and summarized using methods known inthe art such as the methods provided herein. The data may then besubjected to a feature selection step such as any feature selectionmethods known in the art such as the methods provided herein includingbut not limited to the feature selection methods provided in LIMMA. Thedata may then be subjected to a classification step such as any of theclassification methods known in the art such as the use of any of thealgorithms or methods provided herein including but not limited to theuse of SVM or random forest algorithms. The results of the classifieralgorithm may then be ranked by according to a posterior probabilityfunction. For example, the posterior probability function may be derivedfrom examining known molecular profiling results, such as publishedresults, to derive prior probabilities from type I and type II errorrates of assigning a marker to a category (e.g. benign, malignant,normal, ATC, PTC, MTC, FC, FN, FA, FVPTC CN, HA, HC, LCT, NHP etc.).These error rates may be calculated based on reported sample size foreach study using an estimated fold change value (e.g. 1.1, 1.2, 1.3,1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2, 2.2, 2.4, 2.5, 3, 4, 5, 6, 7, 8, 9, 10or more). These prior probabilities may then be combined with amolecular profiling dataset of the present invention to estimate theposterior probability of differential gene expression. Finally, theposterior probability estimates may be combined with a second dataset ofthe present invention to formulate the final posterior probabilities ofdifferential expression. Additional methods for deriving and applyingposterior probabilities to the analysis of microarray data are known inthe art and have been described for example in Smyth, G. K. 2004 Stat.Appl. Genet. Mol. Biol. 3: Article 3. In some cases, the posteriorprobabilities may be used to rank the markers provided by the classifieralgorithm. In some cases, markers may be ranked according to theirposterior probabilities and those that pass a chosen threshold may bechosen as markers whose differential expression is indicative of ordiagnostic for samples that are for example benign, malignant, normal,ATC, PTC, MTC, FC, FN, FA, FVPTC CN, HA, HC, LCT, or NHP. Illustrativethreshold values include prior probabilities of 0.7, 0.75, 0.8, 0.85,0.9, 0.925, 0.95, 0.975, 0.98, 0.985, 0.99, 0.995 or higher.

A statistical evaluation of the results of the molecular profiling mayprovide a quantitative value or values indicative of one or more of thefollowing: the likelihood of diagnostic accuracy, the likelihood ofcancer, disease or condition, the likelihood of a particular cancer,disease or condition, the likelihood of the success of a particulartherapeutic intervention. Thus a physician, who is not likely to betrained in genetics or molecular biology, need not understand the rawdata. Rather, the data is presented directly to the physician in itsmost useful form to guide patient care. The results of the molecularprofiling can be statistically evaluated using a number of methods knownto the art including, but not limited to: the students T test, the twosided T test, pearson rank sum analysis, hidden markov model analysis,analysis of q-q plots, principal component analysis, one way ANOVA, twoway ANOVA, LIMMA and the like.

In some embodiments of the present invention, the use of molecularprofiling alone or in combination with cytological analysis may providea diagnosis that is between about 85% accurate and about 99% or about100% accurate. In some cases, the molecular profiling business maythrough the use of molecular profiling and/or cytology provide adiagnosis of malignant, benign, or normal that is about 85%, 86%, 87%,88%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 97.5%, 98%, 98.5%, 99%,99.5%, 99.75%, 99.8%, 99.85%, or 99.9% accurate.

In some cases, accuracy may be determined by tracking the subject overtime to determine the accuracy of the original diagnosis. In othercases, accuracy may be established in a deterministic manner or usingstatistical methods. For example, receiver operator characteristic (ROC)analysis may be used to determine the optimal assay parameters toachieve a specific level of accuracy, specificity, positive predictivevalue, negative predictive value, and/or false discovery rate. Methodsfor using ROC analysis in cancer diagnosis are known in the art and havebeen described for example in US Patent Application No. 2006/019615herein incorporated by reference in its entirety.

In some embodiments of the present invention, gene expression productsand compositions of nucleotides encoding for such products which aredetermined to exhibit the greatest difference in expression level or thegreatest difference in alternative splicing between benign and normal,benign and malignant, or malignant and normal may be chosen for use asmolecular profiling reagents of the present invention. Such geneexpression products may be particularly useful by providing a widerdynamic range, greater signal to noise, improved diagnostic power, lowerlikelihood of false positives or false negative, or a greaterstatistical confidence level than other methods known or used in theart.

In other embodiments of the present invention, the use of molecularprofiling alone or in combination with cytological analysis may reducethe number of samples scored as non-diagnostic by about 100%, 99%, 95%,90%, 80%, 75%, 70%, 65%, or about 60% when compared to the use ofstandard cytological techniques known to the art. In some cases, themethods of the present invention may reduce the number of samples scoredas intermediate or suspicious by about 100%, 99%, 98%, 97%, 95%, 90%,85%, 80%, 75%, 70%, 65%, or about 60%, when compared to the standardcytological methods used in the art.

In some cases the results of the molecular profiling assays, are enteredinto a database for access by representatives or agents of the molecularprofiling business, the individual, a medical provider, or insuranceprovider. In some cases assay results include interpretation ordiagnosis by a representative, agent or consultant of the business, suchas a medical professional. In other cases, a computer or algorithmicanalysis of the data is provided automatically. In some cases themolecular profiling business may bill the individual, insuranceprovider, medical provider, researcher, or government entity for one ormore of the following: molecular profiling assays performed, consultingservices, data analysis, reporting of results, or database access.

In some embodiments of the present invention, the results of themolecular profiling are presented as a report on a computer screen or asa paper record. In some cases, the report may include, but is notlimited to, such information as one or more of the following: the numberof genes differentially expressed, the suitability of the originalsample, the number of genes showing differential alternative splicing, adiagnosis, a statistical confidence for the diagnosis, the likelihood ofcancer or malignancy, and indicated therapies.

(iv) Categorization of Samples Based on Molecular Profiling Results

The results of the molecular profiling may be classified into one of thefollowing: benign (free of a cancer, disease, or condition), malignant(positive diagnosis for a cancer, disease, or condition), or nondiagnostic (providing inadequate information concerning the presence orabsence of a cancer, disease, or condition). In some cases, a diagnosticresult may further classify the type of cancer, disease or condition. Inother cases, a diagnostic result may indicate a certain molecularpathway involved in the cancer disease or condition, or a certain gradeor stage of a particular cancer disease or condition. In still othercases a diagnostic result may inform an appropriate therapeuticintervention, such as a specific drug regimen like a kinase inhibitorsuch as Gleevec or any drug known to the art, or a surgical interventionlike a thyroidectomy or a hemithyroidectomy.

In some embodiments of the present invention, results are classifiedusing a trained algorithm. Trained algorithms of the present inventioninclude algorithms that have been developed using a reference set ofknown malignant, benign, and normal samples including but not limited tothe samples listed in FIG. 1. Algorithms suitable for categorization ofsamples include but are not limited to k-nearest neighbor algorithms,concept vector algorithms, naive bayesian algorithms, neural networkalgorithms, hidden markov model algorithms, genetic algorithms, andmutual information feature selection algorithms or any combinationthereof. In some cases, trained algorithms of the present invention mayincorporate data other than gene expression or alternative splicing datasuch as but not limited to DNA polymorphism data, sequencing data,scoring or diagnosis by cytologists or pathologists of the presentinvention, information provided by the pre-classifier algorithm of thepresent invention, or information about the medical history of thesubject of the present invention.

(v) Monitoring of Subjects or Therapeutic Interventions Via MolecularProfiling

In some embodiments, a subject may be monitored using methods andcompositions of the present invention. For example, a subject may bediagnosed with cancer or a genetic disorder. This initial diagnosis mayor may not involve the use of molecular profiling. The subject may beprescribed a therapeutic intervention such as a thyroidectomy for asubject suspected of having thyroid cancer. The results of thetherapeutic intervention may be monitored on an ongoing basis bymolecular profiling to detect the efficacy of the therapeuticintervention. In another example, a subject may be diagnosed with abenign tumor or a precancerous lesion or nodule, and the tumor, nodule,or lesion may be monitored on an ongoing basis by molecular profiling todetect any changes in the state of the tumor or lesion.

Molecular profiling may also be used to ascertain the potential efficacyof a specific therapeutic intervention prior to administering to asubject. For example, a subject may be diagnosed with cancer. Molecularprofiling may indicate the upregulation of a gene expression productknown to be involved in cancer malignancy, such as for example the RASoncogene. A tumor sample may be obtained and cultured in vitro usingmethods known to the art. The application of various inhibitors of theaberrantly activated or dysregulated pathway, or drugs known to inhibitthe activity of the pathway may then be tested against the tumor cellline for growth inhibition. Molecular profiling may also be used tomonitor the effect of these inhibitors on for example downstream targetsof the implicated pathway.

(vi) Molecular Profiling as a Research Tool

In some embodiments, molecular profiling may be used as a research toolto identify new markers for diagnosis of suspected tumors; to monitorthe effect of drugs or candidate drugs on biological samples such astumor cells, cell lines, tissues, or organisms; or to uncover newpathways for oncogenesis and/or tumor suppression.

(vii) Biomarker Groupings Based on Molecular Profiling

Thyroid genes are described according to the groups 1) Benign vs.Malignant, 2) alternative gene splicing, 3) KEGG Pathways, 4) NormalThyroid, 5) Thyroid pathology subtype, 6) Gene Ontology, and 7)Biomarkers of metastasis to the thyroid from non-thyroid organs. Methodsand compositions of the invention can have genes selected from one ormore of the groups listed above and/or 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,15, 20, 25, 30, 35, 40, 45, 50 or more subgroups from any of the groupslisted above (e.g. one or more different KEGG pathway) and can have from1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50 or moregene expression products from each group, in any combination. In someembodiments, the set of genes combined give a specificity or sensitivityof greater than 70%, 75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%,93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5%, or a positive predictivevalue or negative predictive value of at least 95%, 95.5%, 96%, 96.5%,97%, 97.5%, 98%, 98.5%, 99%, 99.5% or more.

In some embodiments, the extracellular matrix, adherens, focal adhesion,and tight junction genes are used as biomarkers of thyroid cancer. Insome embodiments, the signaling pathway is selected from one of thefollowing three pathways: adherens pathway, focal adhesion pathway, andtight junction pathway. In some embodiments, at least one gene isselected from one of the 3 pathways. In some embodiments, at least onegene is selected from each one of the three pathways. In someembodiments, at least one gene is selected from two of the threepathways. In some embodiments, at least one gene that is involved in allthree pathways is selected. In one example, a set of genes that isinvolved in adherens pathway, focal adhesion pathway, and tight junctionpathway is selected as the markers for diagnosis of a cancer such asthyroid cancer.

The follicular cells that line thyroid follicles are highly polarizedand organized in structure, requiring distinct roles of their luminaland apical cell membranes. In some embodiments, cytoskeleton, plasmamembrane, and extracellular space genes are used as biomarkers ofthyroid cancer. In some embodiments, genes that overlap all fourpathways, i.e. ECM, focal adhesion, adherens, and tight junctionpathways, are used as biomarkers of thyroid cancer. In one example, thepresent invention provides the Benign vs. malignant group (n=948) as athyroid classification gene list. This list has been grouped accordingto alternative splicing, KEGG pathways, and gene ontology. KEGG pathwaysare further described in Table 1.

In some embodiments, the present invention provides a method ofdiagnosing cancer comprising gene expression products from one or moresignaling pathways that include but are not limited to the following:acute myeloid leukemia signaling, somatostatin receptor 2 signaling,cAMP-mediated signaling, cell cycle and DNA damage checkpoint signaling,G-protein coupled receptor signaling, integrin signaling, melanoma cellsignaling, relaxin signaling, and thyroid cancer signaling. Methods andcompositions of the invention can have genes selected from 1, 2, 3, 4,5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50 or more signalingpathways and can have from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25,30, 35, 40, 45, 50 or more gene expression products from each signalingpathway, in any combination. In some embodiments, the set of genescombined give a specificity or sensitivity of greater than 70%, 75%,80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%,98%, 99%, or 99.5%, or a positive predictive value or negativepredictive value of at least 95%, 95.5%, 96%, 96.5%, 97%, 97.5%, 98%,98.5%, 99%, 99.5% or more.

In some embodiments, the present invention provides a method ofdiagnosing cancer comprising gene expression products from one or moreontology groups that include but are not limited to the following: cellaging, cell cortex, cell cycle, cell death/apoptosis, celldifferentiation, cell division, cell junction, cell migration, cellmorphogenesis, cell motion, cell projection, cell proliferation, cellrecognition, cell soma, cell surface, cell surface linked receptorsignal transduction, cell adhesion, transcription, immune response, orinflammation. Methods and compositions of the invention can have genesselected from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45,50 or more ontology groups and can have from 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 15, 20, 25, 30, 35, 40, 45, 50 or more gene expression products fromeach ontology group, in any combination. In some embodiments, the set ofgenes combined give a specificity or sensitivity of greater than 70%,75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%,97%, 98%, 99%, or 99.5%, or a positive predictive value or negativepredictive value of at least 95%, 95.5%, 96%, 96.5%, 97%, 97.5%, 98%,98.5%, 99%, 99.5% or more.

TABLE 1 Genes involved in the KEGG Pathways KEGG % in Top 948 B Genes inTop 948 B Total Genes in Pathway vs. M list vs. M list Pathway ECM 23 1884 p53 14 10 69 PPAR 14 10 69 Thyroid Cancer 14 4 29 Focal Adhesion 1326 201 Adherens 12 9 77 Tight Junction 11 14 134 Pathways in 10 33 332Cancer Overview Jak/STAT 10 14 155 Cell Cycle 7 9 129 TGFbeta 7 6 87 Wnt7 10 151 ErbB 6 5 87 Apoptosis 6 5 88 MAPK 5 14 269 Autoimmune 4 2 53Thyroid mTOR 2 1 53 VEGF 1 1 76

Top Biomarkers of benign vs. malignant thyroid, n=948, are listed belowin List 1:

List 1

TCID-2406391, TCID-3153400, TCID-3749600, ABCC3, ABCD2, ABTB2, ACBD7,ACSL1, ACTA2, ADAMTS5, ADAMTS9, ADK, ADORA1, AEBP1, AFAP1, AGR2, AHNAK2,AHR, AIDA, AIM2, AK1, AKR1C3, ALAS2, ALDH1A3, ALDH1B1, ALDH6A1, ALOX5,AMIGO2, AMOT, ANGPTL1, ANK2, ANKS6, ANO5, ANXA1, ANXA2, ANXA2P1, ANXA3,ANXA6, AOAH, AP3S1, APOBEC3F, APOBEC3G, APOL1, APOO, AQP4, AQP9,ARHGAP19, ARHGAP24, ARL13B, ARL4A, ARMCX3, ARMCX6, ARNTL, ARSG, ASAP2,ATIC, ATM, ATP13A4, ATP6V0D2, ATP8A1, AUTS2, AVPR1A, B3GNT3, BAG3, BCL2,BCL2A1, BCL9, BHLHE40, BHLHE41, BIRC5, BLNK, BMP1, BMP8A, BTBD11, BTG3,C10orf131, C10orf72, C11orf72, C11orf74, C11orf80, C12orf35, C12orf49,C14orf45, C16orf45, C17orf87, C19orf33, C1orf115, C1orf116, C2, C22orf9,C2orf40, C3, C4A, C4B, C4orf34, C4orf7, C5orf28, C6orf168, C6orf174,C7orf62, C8orf16, C8orf39, C8orf4, C8orf79, C9orf68, CA11, CADM1, CALCA,CAMK2N1, CAMK4, CAND1, CARD16, CARD17, CARD5, CASC5, CASP1, CAV1, CAV2,CCDC109B, CCDC121, CCDC146, CCDC148, CCDC152, CCDC80, CCL13, CCL19,CCND1, CCND2, CD151, CD180, CD2, CD200, CD36, CD3D, CD48, CD52, CD69,CD79A, CD96, CDCP1, CDH11, CDH3, CDH6, CDK2, CDKL2, CDO1, CDON, CDR1,CEP110, CEP55, CERKL, CFB, CFH, CFHR1, CFI, CHAF1B, CHD4, CHGB, CHI3L1,CITED1, CKB, CKS2, CLC, CLDN1, CLDN10, CLDN16, CLDN4, CLDN7, CLEC2B,CLEC4E, CLIP3, CLU, CMAH, CNN2, CNN3, COL12A1, COL1A1, COPZ2, CP, CPE,CPNE3, CR2, CRABP1, CRABP2, CSF3R, CSGALNACT1, CST6, CTNNAL1, CTNNB1,CTSC, CTSH, CTTN, CWH43, CXCL1, CXCL11, CXCL13, CXCL14, CXCL17, CXCL2,CXCL3, CXCL9, CXorf18, CXorf27, CYP1B1, CYP24A1, CYP27A1, CYP4B1,CYSLTR1, CYSLTR2, CYTH1, DAPK2, DCAF17, DCBLD2, DCUN1D3, DDAH1, DDB2,DDX52, DENND4A, DGKH, DGKI, DHRS1, DHRS3, DIO1, DIRAS3, DLC1, DLG2,DLG4, DLGAP5, DNAJB14, DNASE1L3, DOCK8, DOCK9, DOK4, DPH3B, DPP4, DPYD,DPYSL3, DSG2, DSP, DST, DUOX1, DUOX2, DUOXA1, DUOXA2, DUSP4, DUSP5,DUSP6, DYNC1I2, DYNLT1, DZIP1, ECE1, EDNRB, EFEMP1, EGF, EGFR, EHBP1,EHD2, EHF, EIF2B2, EIF4H, ELK3, ELMO1, EMP2, EMR3, ENAH, ENDOD1, ENTPD1,EPB41, EPDR1, EPHA4, EPHX4, EPR1, EPS8, ERBB2, ERBB3, ERI2, ERO1LB,ERP27, ESRRG, ETNK2, ETS1, ETV1, ETV4, ETV5, F2RL2, F8, FAAH2, FABP4,FAM111A, FAM111B, FAM164A, FAM176A, FAM20A, FAM55C, FAM82B, FAM84B,FAT4, FBLN5, FBXO2, FBXO21, FCN1, FCN2, FGF2, FGFR1OP2, FIBIN, FLJ20184,FLJ26056, FLJ32810, FLJ42258, FLRT3, FN1, FPR1, FPR2, FREM2, FRMD3,FXYD6, FYB, FZD4, FZD6, FZD7, G0S2, GABBR2, GABRB2, GADD45A, GALE,GALNT12, GALNT3, GALNT7, GBE1, GBP1, GBP3, GBP5, GGCT, GIMAP2, GIMAP5,GIMAP7, GJA4, GLA, GLDC, GLDN, GLIS3, GNG12, GOLT1A, GPAM, GPR110,GPR125, GPR155, GPR174, GPR98, GPRC5B, GRAMD3, GSN, GTF3A, GULP1, GYPB,GYPC, GYPE, GZMA, GZMK, HEMGN, HEY2, HIGD1A, HIPK2, HIST1H1A, HIST1H3B,HIST1H4L, HK1, HLA-DPB1, HLA-DQB2, HLF, HMGA2, HMMR, HNRNPM, HPN, HPS3,HRASLS, HSD17B6, HSPH1, ICAM1, ID3, IFI16, IFITM1, IFNAR2, IGF2BP2,IGFBP5, IGFBP6, IGFBP7, IGJ, IGK, IGKC, IGKV1-5, IGKV3-15, IGKV3-20,IGKV3D-11, IGKV3D-15, IGSF1, IKZF2, IKZF3, IKZF4, IL1RAP, IL1RL1, IL2RA,IL7R, IL8, IL8RA, IL8RB, IL8RBP, IMPDH2, INPP5F, IPCEF1, IQGAP2, ISYNA1,ITGA2, ITGA3, ITGA4, ITGA9, ITGB1, ITGB4, ITGB6, ITGB8, ITM2A, ITPR1,IYD, JAK2, JUB, KAL1, KATNAL2, KBTBD8, KCNA3, KCNAB1, KCNK5, KCNQ3,KCTD14, KDELC1, KDELR3, KHDRBS2, KIAA0284, KIAA0408, KIAA1217, KIAA1305,KIF11, KIT, KLF8, KLHDC8A, KLHL6, KLK10, KLK7, KLRB1, KLRC4, KLRG1,KLRK1, KRT18, KRT19, KYNU, LAMB1, LAMB3, LAMC1, LAMC2, LCA5, LCMT1,LCN2, LCP1, LDOC1, LEMD1, LGALS2, LGALS3, LIFR, LILRA1, LILRB1, LIMA1,LINGO2, LIPH, LMO3, LMO4, LOC100124692, LOC100127974, LOC100129112,LOC100129115, LOC100129171, LOC100129961, LOC100130100, LOC100130248,LOC100131102, LOC100131490, LOC100131869, LOC100131938, LOC100131993,LOC100132338, LOC100132764, LOC26080, LOC283508, LOC284861, LOC439911,LOC440434, LOC440871, L00554202, LOC643454, LOC646358, LOC648149,LOC650405, LOC652493, LOC652694, LOC653264, LOC653354, LOC653498,LOC728212, LOC729461, LOC730031, LONRF2, LOX, LPAR1, LPAR5, LPCAT2, LPL,LRP1B, LRP2, LRRC69, LRRN1, LRRN3, LTBP2, LTBP3, LUM, LYPLA1, LYRM1,LYZ, MACC1, MAFG, MAGOH2, MAMLD1, MAP2, MAPK4, MAPK6, MATN2, MBOAT2,MCM4, MCM7, MDK, ME1, MED13, MED13L, MELK, MET, METTL7B, MEX3C, MFGE8,MGAM, MGAT1, MGAT4C, MGC2889, MGST1, MIS12, MKI67, MLLT3, MLLT4, MMP16,MNDA, MORC4, MPPED2, MPZL2, MRC2, MRPL14, MT1F, MT1G, MT1H, MT1M, MT1P2,MT1P3, MTHFD1L, MTIF3, MUC1, MUC15, MVP, MXRA5, MYEF2, MYH10, MYO1B,MYO1D, MYO5A, MYO6, NAB2, NAE1, NAG20, NAV2, NCAM1, NCKAP1, ND1, NDC80,NDFIP2, NEB, NEDD4L, NELL2, NEXN, NFATC3, NFE2, NFIB, NFKBIZ, NIPAL3,NIPSNAP3A, NIPSNAP3B, NOD1, NPAS3, NPAT, NPC2, NPEPPS, NPL, NPY1R,NRCAM, NRIP1, NRP2, NT5E, NTAN1, NUCB2, NUDT6, NUPR1, NUSAP1, OCIAD2,OCR1, ODZ1, ORAOV1, OSBPL1A, OSGEP, OSMR, P2RY13, P4HA2, PAM, PAPSS2,PARD6B, PARP14, PARP4, PARVA, PBX1, PCDH1, PCMTD1, PCNXL2, PDE5A, PDE9A,PDGFRL, PDK4, PDLIM1, PDLIM4, PDZRN4, PEG10, PERP, PGCP, PHEX, PHF16,PHLDB2, PHYHIP, PIAS3, PIGN, PKHD1L1, PKP2, PKP4, PLA2G16, PLA2G7,PLA2R1, PLAG1, PLAU, PLCD3, PLCL1, PLEK, PLEKHA4, PLEKHA5, PLEKHF2,PLK2, PLP2, PLS3, PLSCR4, PLXNC1, PMEPA1, POLR2J4, PON2, POR, POU2F3,PPAP2C, PPARGC1A, PPBP, PPL, PPP1R14C, PRCP, PRICKLE1, PRINS, PRMT6,PROK2, PROS1, PRR15, PRRG1, PRSS23, PSAT1, PSD3, PTK7, PTPN14, PTPN22,PTPRC, PTPRE, PTPRF, PTPRG, PTPRK, PTPRU, PTRF, PXDNL, PYGL, PYHIN1,QTRT1, RAB25, RAB27A, RAB32, RAB34, RAD23B, RAG2, RAI2, RAPGEF5, RARG,RASA1, RASD2, RBBP7, RBBP8, RBMS2, RCBTB2, RCE1, RDH5, RG9MTD2, RGS13,RGS18, RGS2, RHOBTB3, RHOH, RHOU, RICH2, RIMS2, RNASE1, RNASET2, RND3,ROS1, RPL39L, RPL9P11, RPRD1A, RPS6KA6, RRAS, RRAS2, RRBP1, RRM2, RUNX1,RUNX2, RXRG, RYR2, S100A12, S100A14, S100A16, S100A8, S100A9, SALL1,SAV1, SC4MOL, SCARA3, SCARNA11, SCEL, SCG3, SCG5, SCNN1A, SCP2, SCRN1,SDC4, SDK1, SEH1L, SEL1L3, SELL, SEMA3C, SEMA3D, SEMA4C, SEPP1, SEPT11,SERGEF, SERINC2, SERPINA1, SERPINA2, SERPINE2, SERPING1, SFN, SFTPB,SGCB, SGCE, SGEF, SGMS2, SGPP2, SH2D4A, SH3BGR, SH3PXD2A, SIPA1L2,SIRPA, SIRPB1, SLA, SLC12A2, SLC16A4, SLC16A6, SLC17A5, SLC24A5,SLC25A33, SLC26A4, SLC26A7, SLC27A2, SLC27A6, SLC34A2, SLC35D2, SLC35F2,SLC39A6, SLC4A4, SLC5A8, SLC7A11, SLC7A2, SLIT1, SLIT2, SLPI, SMAD9,SMOC2, SMURF2, SNCA, SNX1, SNX22, SNX7, SOAT1, SORBS2, SP140, SP140L,SPATS2, SPATS2L, SPC25, SPINT1, SPOCK1, SPP1, SPRED2, SPRY1, SPRY2,SQLE, SRL, SSPN, ST20, ST3GAL5, STAT4, STEAP2, STK17B, STK32A, STXBP6,SULF1, SYNE1, SYT14, SYTL5, TACSTD2, TASP1, TBC1D3F, TC2N, TCERG1L,TCF7L2, TCFL5, TDRKH, TEAD1, TFCP2L1, TFF3, TFPI, TGFA, TGFB2, TGFBR1,THSD4, TIAM2, TIMP1, TIMP3, TIPARP, TJP1, TJP2, TLCD1, TLE4, TLR10,TLR8, TM4SF1, TM4SF4, TM7SF4, TMEM100, TMEM117, TMEM133, TMEM156,TMEM163, TMEM171, TMEM215, TMEM220, TMEM90A, TMEM98, TMPRSS4, TMSB10,TMSB15A, TMSB15B, TNC, TNFAIP8, TNFRSF11B, TNFRSF12A, TNFRSF17, TNFSF10,TNFSF15, TOMM34, TOX, TPD52L1, TPO, TPX2, TRIP10, TRPC5, TRPC6, TSC22D1,TSHZ2, TSPAN13, TSPAN6, TSPAN8, TSSC1, TTC39A, TUBB1, TUBB6, TULP3,TUSC3, TXNL1, TXNRD1, TYMS, UCHL5, VAMP1, VNN1, VNN2, VNN3, WDR40A,WDR54, WDR72, WIPI1, WNT5A, XKRX, XPR1, YIF1B, YIPF1, YTHDC2, ZBTB33,ZCCHC12, ZCCHC16, ZEB2, ZFP36L1, ZFPM2, ZMAT3, ZMAT4, ZNF143, ZNF208,ZNF487, ZNF643, ZNF804B, ZYG11A.

Alternative spliced genes, n=283, are listed below in List 2:

List 2

ABCC3, ADAMTS5, ADAMTS9, AIDA, AK1, AKR1C3, ALDH1A3, ALDH6A1, AMIGO2,AMOT, ANGPTL1, ANKS6, ANO5, ANXA1, ANXA2, ANXA2P1, ANXA3, AQP4,ARHGAP24, ARL4A, ARMCX3, ARMCX6, ARSG, ATIC, ATP13A4, ATP8A1, AUTS2,BAG3, BCL2, BCL9, BHLHE41, C10orf131, C11orf74, C14orf45, C16orf45,C19orf33, C2orf40, C3, C5orf28, C8orf79, CA11, CALCA, CAV1, CCND1,CCND2, CD36, CD36, CDH3, CDH6, CDON, CFH, CFHR1, CHD4, CITED1, CLDN16,CLU, COPZ2, CP, CRABP1, CSGALNACT1, CTSC, CTSH, CTTN, CWH43, CYSLTR2,DCBLD2, DCUN1D3, DDB2, DGKH, DGKI, DIO1, DLG2, DOCK9, DPH3B, DPP4, DSP,DST, DUSP6, EFEMP1, EIF2B2, ELMO1, EMP2, ENAH, ENTPD1, EPHX4, ERBB3,ERI2, ERO1LB, ETNK2, ETV1, ETV5, F8, FABP4, FAM111B, FAM20A, FAM55C,FAT4, FBLN5, FGFR1OP2, FLJ42258, FLRT3, FN1, FREM2, FXYD6, GABBR2,GABRB2, GALNT7, GBE1, GBP1, GBP3, GGCT, GIMAP7, GPAM, GPR125, GPR155,GRAMD3, GSN, HLF, HMGA2, HSPH1, IMPDH2, IQGAP2, ITGA2, ITGA3, ITGA9,ITGB6, ITGB8, ITM2A, ITPR1, IYD, KATNAL2, KCNA3, KCNQ3, KDELC1, KHDRBS2,KIAA0284, KIAA1217, KIT, KLF8, KLK10, KRT19, LAMB3, LAMC2, LEMD1, LIFR,LINGO2, LMO3, LOC100127974, LOC100129112, LOC100131490, LOC100131869,LOC283508, LOC648149, LOC653354, LONRF2, LPCAT2, LPL, LRP1B, LRP2,LRRC69, LRRN1, LRRN3, LYRM1, MACC1, MAFG, MAP2, MAPK4, MAPK6, MATN2,MED13, MET, METTL7B, MFGE8, MLLT3, MPPED2, MPZL2, MRPL14, MT1F, MT1G,MT1H, MT1P2, MTHFD1L, MUC1, MVP, MYEF2, MYH10, MYO1D, NAG20, NAV2, NEB,NEDD4L, NELL2, NFATC3, NFKBIZ, NPC2, NRCAM, NUCB2, ORAOV1, P4HA2, PAM,PAPSS2, PARVA, PDLIM4, PEG10, PGCP, PIGN, PKHD1L1, PLA2G16, PLA2G7,PLA2R1, PLAU, PLEKHA4, PLP2, PLSCR4, PLXNC1, PMEPA1, PON2, PPARGC1A,PRINS, PROS1, PSD3, PTPRK, PYHIN1, QTRT1, RAB27A, RAB34, RAD23B, RASA1,RHOBTB3, RNASET2, RPS6KA6, RUNX1, SCARNA11, SCG5, SDC4, SERPINA1,SERPINA2, SGEF, SH2D4A, SLA, SLC12A2, SLC24A5, SLC26A4, SLC26A7,SLC27A2, SLC27A6, SLC35F2, SLC4A4, SLC5A8, SLC7A2, SOAT1, SPATS2,SPATS2L, SPINT1, SPP1, SSPN, STK32A, SULF1, SYNE1, TCFL5, TFPI, TGFBR1,TIPARP, TJP1, TLE4, TM7SF4, TMEM171, TMEM90A, TNFAIP8, TNFRSF11B,TOMM34, TPD52L1, TPO, TSC22D1, TUSC3, TYMS, WDR54, WDR72, WIPI1, XPR1,YIF1B, ZFPM2, ZMAT4.

Genes involved in the KEGG pathways are listed below in Table 6: thereare 18 pathways with a total of n=109 unique genes.

TABLE 6 Signaling Pathway Number of Genes Genes ECM Pathway 19 CD36,COL1A1, FN1, HMMR, ITGA2, ITGA3, ITGA4, ITGA9, ITGB1, ITGB4, ITGB6,ITGB8, LAMB1, LAMB3, LAMC1, LAMC2, SDC4, SPP1, TNC p53 Pathway 10 ATM,CCND1, CCND2, CDK2, DDB2, GADD45A, PERP, RRM2, SFN, ZMAT3 PPAR Pathway10 ACSL1, CD36, CYP27A1, FABP4, LPL, ME1, RXRG, SCP2, SLC27A2, SLC27A6Thyroid Cancer Pathway 4 CCND1, CTNNB1, RXRG, TCF7L2 Focal AdhesionPathway 26 BCL2, CAV1, CAV2, CCND1, CCND2, COL1A1, CTNNB1, EGF, EGFR,ERBB2, FN1, ITGA2, ITGA3, ITGA4, ITGA9, ITGB1, ITGB6, ITGB8, LAMB1,LAMB3, LAMC1, LAMC2, MET, PARVA, SPP1, TNC Adherens Pathway 9 CTNNB1,EGFR, ERBB2, MET, MLLT4, PTPRF, TCF7L2, TGFBR1, TJP1 Tight JunctionsPathway 15 CLDN1, CLDN10, CLDN16, CLDN4, CLDN7, CTNNB1, CTTN, EPB41,MLLT4, MYH10, PARD6B, RRAS, RRAS2, TJP1, TJP2 Pathways in Cancer 34BCL2, BIRC5, CCND1, CDK2, Overview CSF3R, CTNNB1, DAPK2, EGF, EGFR,ERBB2, ETS1, FGF2, FN1, FZD4, FZD6, FZD7, IL8, ITGA2, ITGA3, ITGB1, KIT,LAMB1, LAMB3, LAMC1, LAMC2, MET, PIAS3, RUNX1, RXRG, TCF7L2, TGFA,TGFB2, TGFBR1, WNT5A Jak/STAT Pathway 16 CCND1, CCND2, CSF3R, IFNAR2,IL2RA, IL7R, ITGB4, JAK2, LIFR, OSMR, PIAS3, SPRED2, SPRY1, SPRY2,STAT4, TPO Cell Cycle Pathway 9 ATM, CCND1, CCND2, CDK2, GADD45A, MCM4,MCM7, SFN, TGFB2 TGFbeta Pathway 6 BMP8A, ID3, SMAD9, SMURF2, TGFB2,TGFBR1 Wnt Pathway 10 CCND1, CCND2, CTNNB1, FZD4, FZD6, FZD7, NFATC3,PRICKLE1, TCF7L2, WNT5A Erb Pathway 5 EGF, EGFR, ERBB2, ERBB3, TGFAApoptosis Pathway 5 ATM, BCL2, ENDOD1, IL1RAP, TNFSF10 MAPK Pathway 14DUSP4, DUSP5, DUSP6, EGF, EGFR, FGF2, GADD45A, GNG12, RASA1, RPS6KA6,RRAS, RRAS2, TGFB2, TGFBR1 Autoimmune Thyroid 2 HLA-DPB1, TPO pathwaymTOR Pathway 1 RPS6KA6 VEGF Pathway 1 NFATC3

Top genes separating benign and malignant thyroid (combined) from normalthyroid, n=55, are listed below in List 3:

List 3

ANGPTL1, ANXA3, C10orf131, C2orf40, C7orf62, CAV1, CCDC80, CDR1, CFH,CFHR1, CLDN16, CP, CRABP1, EFEMP1, ENTPD1, FABP4, FBLN5, FN1, GBP1,GBP3, GULP1, HSD17B6, IPCEF1, KIT, LRP1B, LRRC69, LUM, MAPK6, MATN2,MPPED2, MT1F, MT1G, MT1H, MT1M, MT1P2, MT1P3, MYEF2, NRCAM, ODZ1,PAPSS2, PKHD1L1, PLA2R1, RYR2, SEMA3D, SLC24A5, SLC26A4, SLC26A7, SLIT2,TFPI, TMEM171, TPO, TSPAN8, YTHDC2, ZFPM2, ZNF804B.

Thyroid surgical pathology subtypes, n=873, are listed below:

(i) List 4: FA Subtype, n=243

TCID-3124344, AHR, ALOX5, ANGPTL1, ANXA2, ANXA2P1, APOL1, AVPR1A, BMP8A,BTBD11, C2, C3, C8orf39, CCDC109B, CD36, CDON, CFB, CHGB, CHI3L1, CKB,CLDN1, CP, CRABP1, CTSC, CTSH, CXCL1, CXCL2, CXCL3, CXorf27, CYP1B1,DLG2, DNASE1L3, DPP4, DUOX1, DUOX2, DYNLT1, EIF4H, F8, FABP4, FAM20A,FAM55C, FBLN5, F1126056, FXYD6, G0S2, GALNT7, GLIS3, GPAM, HIGD1A, HK1,HLF, HSD17B6, ICAM1, IGFBP7, IL1RAP, IPCEF1, IYD, KATNAL2, KCNAB1,KHDRBS2, KLF8, KLHDC8A, LAMB1, LGALS3, LOC100131869, LOC26080,LOC284861, LOC439911, LOC653264, LOC728212, LOC729461, LPCAT2, LRRC69,MAGOH2, MAPK4, MAPK6, MELK, MPPED2, MT1G, NEB, NFKBIZ, NRIP1, PARP14,PKHD1L1, PLA2G7, PLP2, PLXNC1, POR, PRMT6, PROS1, PSMB2, PTPRE, PYGL,RNASE1, RNASET2, RPL9P11, RRAS2, RRBP1, RUNX1, RUNX2, RYR2, SCP2,SEL1L3, SERGEF, SGPP2, SH3BGR, SLC25A33, SLC26A4, SLC26A7, SLC27A6,SLC4A4, SLPI, SORBS2, SQLE, STK32A, SYTL5, TFCP2L1, TIAM2, TIMP3,TMEM220, TMSB10, TRPC6, TSHZ2, TSSC1, VAMP1, ZNF487, ABCC3, C11orf72,C8orf79, CLDN16, CLU, CST6, CYSLTR2, DIO1, DPH3B, ERO1LB, FN1, GABRB2,IGFBP6, IKZF3, KIT, KRT19, LIFR, LIPH, MACC1, MAFG, MPZL2, MT1F, MT1H,MT1P2, NELL2, ODZ1, RAG2, ROS1, SERPINA1, SERPINA2, SLC34A2, TCFL5,TIMP1, TPO, ZMAT4, ADAMTS9, ALDH1B1, ALDH6A1, ANO5, APOO, C10orf72,C11orf74, C14orf45, C2orf40, C4A, C4B, C5orf28, C6orf174, CAMK2N1,CCDC121, CCND1, CDH3, CITED1, COPZ2, CPNE3, CRABP2, CSGALNACT1, DAPK2,DLC1, ECE1, EIF2B2, EMP2, ERBB2, FAM82B, FIBIN, F1142258, FRMD3, HEY2,HRASLS, ID3, IGF2BP2, IGSF1, IKZF2, ITGA9, KIAA0408, KIAA1305, LMO3,MATN2, MDK, MET, METTL7B, MFGE8, MGC2889, MIS12, NAV2, NCAM1, NIPSNAP3A,NIPSNAP3B, NOD1, NTAN1, NUCB2, NUPR1, PCMTD1, PIGN, PLAG1, PSAT1, PXDNL,QTRT1, RG9MTD2, RXRG, SDC4, SLC35D2, SLC7A11, SMAD9, SPRY1, STEAP2,TA5P1, TCF7L2, TMEM171, TNFRSF11B, TNFRSF12A, TRPC5, TXNL1, WDR72,YIPF1, ZCCHC12, ZCCHC16.

(ii) List 5: FC Subtype, n=102

TCID-3124344, ABCC3, ANGPTL1, AVPR1A, C8orf39, CD2, CD36, CD48, CD52,CKB, CLDN1, CLDN16, CRABP1, CXCL9, DIO1, DLG2, DNASE1L3, DPH3B, DYNLT1,EIF4H, ERO1LB, F8, FABP4, FBLN5, F1126056, FXYD6, FYB, GLIS3, GULP1,GZMA, GZMK, HK1, HLA-DPB1, IFITM1, IGFBP7, IGJ, IGK@, IGKC, IGKV1-5,IGKV3-15, IGKV3-20, IGKV3D-11, IGKV3D-15, IPCEF1, KHDRBS2, KLHDC8A,KLRC4, KLRK1, LAMB1, LCP1, LIFR, LOC100130100, LOC100131869, LOC26080,LOC284861, LOC439911, LOC440871, LOC650405, LOC652493, LOC652694,LOC653264, LOC728212, LOC729461, LYZ, MAGOH2, MAPK4, MT1F, MT1H, MT1P2,NEB, ODZ1, PLA2G7, POR, PRMT6, PSMB2, PTPRC, RAG2, RNASE1, RNASET2,RPL9P11, RRAS2, RRBP1, RYR2, SCP2, SERGEF, SGPP2, SH3BGR, SLC25A33,SLC26A4, SQLE, STK32A, TCFL5, TFCP2L1, TIAM2, TIMP3, TMEM220, TPO,TRPC6, TSSC1, VAMP1, ZFPM2, ZNF487.

(iii) List 6: LCT Subtype, n=140

ADAMTS9, AIM2, APOBEC3F, APOBEC3G, ARHGAP19, ATP13A4, BAG3, BCL2A1,BIRC5, BLNK, C10orf72, C11orf72, C12orf35, C4orf7, C6orf168, CALCA,CARD17, CARD8, CASP1, CCL19, CCND1, CD180, CD2, CD3D, CD48, CD52, CD79A,CD96, CEP110, CHGB, CLDN16, CLEC2B, CNN2, COL12A1, CR2, CXCL13, CXCL9,CYTH1, DENND4A, DNAJB14, DOCK8, DPYD, DUOX1, DUOX2, DUOXA1, DUOXA2,DUSP6, DYNC1I2, EGF, EPDR1, EPR1, EPS8, ETS1, F1142258, FYB, GABBR2,GABRB2, GALNT7, GBP5, GIMAP2, GIMAP5, GIMAP7, GPR155, GPR174, GTF3A,GZMA, GZMK, HIST1H3B, HIST1H4L, HLA-DPB1, HNRNPM, IFI16, IFITM1, IFNAR2,IGF2BP2, IGJ, IGK@, IGKC, IGKV1-5, IGKV3-15, IGKV3-20, IGKV3D-11,IGKV3D-15, IKZF3, IL7R, ITM2A, JAK2, KBTBD8, KLHL6, KLRC4, KLRG1, KLRK1,KYNU, LCP1, LIPH, LOC100130100, LOC100131490, LOC440871, LOC646358,LOC650405, LOC652493, LOC652694, LONRF2, LYZ, MED13L, METTL7B, MPZL2,MTIF3, NAV2, ND1, NFATC3, ODZ1, PAPSS2, PROS1, PSD3, PTPRC, PYGL,PYHIN1, RAD23B, RGS13, RIMS2, RRM2, SCG3, SLIT1, SP140, SP140L, SPC25,ST20, ST3GAL5, STAT4, STK32A, TC2N, TLE4, TNFAIP8, TNFRSF17, TNFSF10,TOX, UCHL5, ZEB2, ZNF143.

(iv) List 7: FVPTC Subtype, n=182

ABCC3, ADAMTS9, AIDA, ALDH1B1, ALDH6A1, ANK2, ANO5, APOL1, APOO, AQP4,ATP13A4, BMP8A, C10orf72, C11orf72, C11orf74, C12orf35, C14orf45,C2orf40, C4A, C4B, C5orf28, C6orf174, C8orf79, CAMK2N1, CCDC121, CCND1,CCND2, CD36, CDH3, CITED1, CLDN1, CLDN16, CLDN4, CLEC2B, CLU, COPZ2,CPNE3, CRABP2, CSGALNACT1, CST6, CWH43, CYSLTR2, DAPK2, DCAF17, DIO1,DIRAS3, DLC1, DOCK9, DPH3B, DUOX1, DUOX2, DUOXA1, DUOXA2, DUSP6, ECE1,EIF2B2, EMP2, ERBB2, ERO1LB, ESRRG, FABP4, FAM82B, FAT4, FIBIN,F1142258, FN1, FRMD3, GABBR2, GABRB2, GIMAP2, GIMAP7, GPR155, GPR98,GTF3A, GZMA, GZMK, HEY2, HRASLS, ID3, IGF2BP2, IGFBP6, IGSF1, IKZF2,IKZF3, ITGA9, JAK2, KIAA0284, KIAA0408, KIAA1217, KIAA1305, KIT, KLRC4,KLRK1, KRT19, LGALS3, LIFR, LIPH, LMO3, LOC100131490, LOC100131993,LRP1B, LRP2, MACC1, MAFG, MAPK6, MATN2, MDK, MET, METTL7B, MFGE8,MGC2889, MIS12, MPPED2, MPZL2, MT1F, MT1G, MT1H, MT1P2, MTIF3, NAV2,NCAM1, NELL2, NFATC3, NIPSNAP3A, NIPSNAP3B, NOD1, NRCAM, NTAN1, NUCB2,NUPR1, ODZ1, PCMTD1, PDE5A, PIGN, PKHD1L1, PLA2R1, PLAG1, PLSCR4, PRINS,PSAT1, PXDNL, QTRT1, RAG2, RCBTB2, RG9MTD2, ROS1, RPS6KA6, RXRG, SALL1,SCG5, SDC4, SERPINA1, SERPINA2, SLC26A4, SLC34A2, SLC35D2, SLC7A11,SMAD9, SPRY1, ST3GAL5, STEAP2, STK32A, TASP1, TCF7L2, TCFL5, TIMP1,TMEM171, TMEM215, TNFAIP8, TNFRSF11B, TNFRSF12A, TNFSF10, TPO, TRPC5,TXNL1, UCHL5, WDR72, YIPF1, ZCCHC12, ZCCHC16, ZMAT4, ZYG11A.

(v) List 8: PTC Subtype, n=604

TCID-3153400, TCID-3749600, ABCC3, ABTB2, ACBD7, ACSL1, ACTA2, ADAMTS5,ADAMTS9, ADK, AGR2, AHNAK2, AHR, AIDA, AK1, ALAS2, ALDH1A3, ALOX5,AMIGO2, AMOT, ANK2, ANXA1, ANXA2, ANXA2P1, ANXA3, AOAH, AP3S1, APOL1,AQP9, ARHGAP24, ARL13B, ARL4A, ARMCX3, ARMCX6, ARNTL, ASAP2, ATIC,ATP13A4, ATP13A4, B3GNT3, BCL9, BHLHE40, BHLHE41, BMP8A, BTBD11, BTG3,C11orf72, C11orf80, C12orf49, C16orf45, C19orf33, C1orf115, C1orf116,C2, C2orf40, C3, C4A, C4B, C4orf34, C6orf168, C6orf174, C7orf62, C8orf4,C8orf79, CA11, CADM1, CAMK2N1, CAND1, CAV1, CAV2, CCDC109B, CCDC121,CCDC148, CCDC80, CCL13, CCND1, CCND2, CD151, CD200, CD36, CDCP1, CDH11,CDH3, CDH6, CDK2, CDKL2, CDO1, CDON, CDR1, CFB, CFH, CFHR1, CFI, CHAF1B,CHI3L1, CITED1, CKS2, CLC, CLDN1, CLDN10, CLDN16, CLDN4, CLDN7, CLEC4E,CLU, CNN3, COL1A1, CP, CRABP1, CRABP2, CSF3R, CST6, CTNNAL1, CTNNB1,CTSC, CTSH, CTTN, CXCL1, CXCL14, CXCL17, CXCL2, CXCL3, CXorf18, CXorf27,CYP1B1, CYSLTR2, DAPK2, DCBLD2, DCUN1D3, DDAH1, DDB2, DDX52, DGKH, DGKI,DHRS1, DHRS3, DIO1, DIRAS3, DLC1, DOCK9, DPP4, DPYSL3, DSG2, DSP, DST,DUSP4, DUSP5, DUSP6, DZIP1, ECE1, EDNRB, EGFR, EHBP1, EHD2, EHF, ELK3,ELMO1, EMP2, EMR3, ENAH, ENDOD1, EPB41, EPHA4, EPHX4, EPS8, ERBB3, ERI2,ERP27, ESRRG, ETNK2, ETV1, ETV5, F2RL2, FAAH2, FABP4, FAM111A, FAM111B,FAM164A, FAM176A, FAM20A, FAM55C, FAM84B, FBXO2, FBXO21, FCN1, FCN2,FGF2, FGFR10P2, FLJ20184, FLJ32810, FLJ42258, FLRT3, FN1, FPR1, FPR2,FRMD3, FZD4, FZD6, FZD7, G0S2, GABBR2, GABRB2, GADD45A, GALE, GALNT12,GALNT3, GALNT7, GBP1, GBP3, GGCT, GLDN, GNG12, GOLT1A, GPAM, GPR110,GPR110, GPR125, GPR98, GPRC5B, GRAMD3, GSN, GYPB, GYPC, GYPE, HEMGN,HEY2, HIGD1A, HIST1H1A, HLA-DQB2, HLF, HMGA2, HPN, HSPH1, ICAM1,IGF2BP2, IGFBP5, IGFBP6, IGSF1, IKZF3, IL1RAP, IL1RL1, IL8RA, IL8RB,IL8RB, IL8RBP, IL8RBP, IMPDH2, INPP5F, IPCEF1, IQGAP2, ITGA2, ITGA3,ITGA9, ITGB1, ITGB6, ITGB8, ITPR1, JUB, KAL1, KATNAL2, KCNK5, KCNQ3,KCTD14, KDELC1, KDELR3, KHDRBS2, KIAA0284, KIAA0408, KIAA1217, KIT,KLF8, KLK10, KLK7, KRT18, KRT19, LAMB3, LAMC1, LAMC2, LCA5, LCMT1, LCN2,LDOC1, LEMD1, LGALS3, LILRA1, LILRB1, LIMA1, LINGO2, LIPH, LMO3, LMO4,LOC100124692, LOC100127974, LOC100129112, LOC100129115, LOC100129171,LOC100129961, LOC100130248, LOC100131102, LOC100131490, LOC100131938,LOC100132338, LOC100132764, LOC283508, LOC440434, L00554202, LOC643454,LOC648149, LOC653354, LOC653498, LOC730031, LONRF2, LOX, LPAR5, LPL,LRP1B, LRP2, LRRC69, LRRN1, LUM, LYRM1, MACC1, MAFG, MAMLD1, MAP2,MAPK6, MATN2, MBOAT2, MCM4, MCM7, MDK, MED13, MET, METTL7B, MEX3C,MFGE8, MGAM, MGAT4C, MGST1, MLLT4, MMP16, MMP16, MNDA, MORC4, MPPED2,MPZL2, MRPL14, MT1F, MT1G, MT1H, MT1M, MT1P2, MT1P3, MTHFD1L, MUC1,MUC15, MVP, MXRA5, MYEF2, MYH10, MYO1B, MYO1D, MYO6, NAB2, NAE1, NAG20,NCKAP1, NDFIP2, NEDD4L, NELL2, NEXN, NFE2, NFIB, NFKBIZ, NIPAL3, NOD1,NPC2, NPEPPS, NPY1R, NRCAM, NRIP1, NRP2, NT5E, NUDT6, OCIAD2, OCR1,ODZ1, OSGEP, OSMR, P2RY13, P4HA2, PAM, PARP14, PARP4, PARVA, PBX1,PDE5A, PDE9A, PDGFRL, PDLIM1, PDLIM4, PDZRN4, PEG10, PERP, PHEX, PHF16,PHLDB2, PHYHIP, PKHD1L1, PKP4, PLA2G16, PLA2R1, PLAG1, PLAU, PLCD3,PLEKHA4, PLEKHA5, PLK2, PLP2, PLS3, PLXNC1, PMEPA1, PON2, PPARGC1A,PPBP, PPL, PPP1R14C, PRICKLE1, PRINS, PROK2, PROS1, PRR15, PRRG1,PRSS23, PSD3, PTPN14, PTPRE, PTPRF, PTPRG, PTPRK, PTRF, QTRT1, RAB25,RAB27A, RAB34, RAD23B, RAG2, RAI2, RAPGEF5, RARG, RASA1, RASD2, RBBP7,RBBP8, RBMS2, RCE1, RDH5, RGS18, RGS2, RHOU, RND3, ROS1, RPL39L, RPRD1A,RPS6KA6, RRAS, RUNX1, RUNX2, RXRG, S100A12, S100A14, S100A16, S100A8,S100A9, SALL1, SAV1, SC4MOL, SCARA3, SCARNA11, SCEL, SCG5, SCNN1A,SCRN1, SDC4, SEH1L, SEL1L3, SELL, SEMA3D, SEPT11, SERINC2, SERPINA1,SERPINA2, SERPINE2, SERPING1, SFN, SFTPB, SGCB, SGCE, SGEF, SGMS2,SH2D4A, SH3PXD2A, SIRPA, SIRPB1, SLA, SLC12A2, SLC16A4, SLC17A5,SLC24A5, SLC26A4, SLC26A7, SLC27A2, SLC27A6, SLC34A2, SLC35F2, SLC39A6,SLC4A4, SLC5A8, SLC7A2, SLIT2, SLPI, SMOC2, SMURF2, SNCA, SNX1, SNX22,SNX7, SORBS2, SPATS2, SPATS2L, SPINT1, SPRED2, SPRY1, SPRY2, SRL, SSPN,ST3GAL5, STK32A, SULF1, SYNE1, SYT14, SYTL5, TACSTD2, TBC1D3F, TDRKH,TEAD1, TEAD1, TFCP2L1, TFF3, TGFA, TGFB2, TGFBR1, TIMP1, TIPARP, TJP1,TJP2, TLCD1, TLR8, TM4SF1, TM4SF4, TM7SF4, TMEM100, TMEM117, TMEM133,TMEM163, TMEM215, TMEM90A, TMEM98, TMPRSS4, TMSB10, TNC, TNFRSF12A,TNFSF15, TOMM34, TPD52L1, TPO, TRIP10, TRPC5, TSC22D1, TSPAN13, TSPAN6,TUBB1, TUBB6, TULP3, TUSC3, TYMS, VNN2, VNN3, WDR40A, WDR54, WNT5A,XKRX, XPR1, YIF1B, YTHDC2, ZBTB33, ZCCHC12, ZCCHC16, ZFP36L1, ZMAT3,ZMAT4, ZNF643, ZNF804B.

(vi) List 9: NHP Subtype, n=653

TCID-3153400, TCID-3749600, ABTB2, ACBD7, ACSL1, ACTA2, ADAMTS5,ADAMTS9, ADK, AGR2, AHNAK2, AHR, AIDA, AK1, AKR1C3, ALAS2, ALDH1A3,AMIGO2, AMOT, ANK2, ANO5, ANXA1, ANXA3, ANXA6, AOAH, AP3S1, APOO, AQP4,AQP9, ARHGAP24, ARL13B, ARL4A, ARMCX3, ARMCX6, ARNTL, ARSG, ASAP2, ATIC,ATP13A4, ATP6V0D2, B3GNT3, BCL9, BHLHE40, BHLHE41, BMP8A, BTBD11, BTG3,C10orf72, C11orf72, C11orf74, C11orf80, C12orf49, C16orf45, C19orf33,C1orf115, C1orf116, C2, C22orf9, C2orf40, C3, C4A, C4B, C4orf34,C5orf28, C6orf168, C6orf174, C7orf62, C8orf4, C8orf79, C9orf68, CA11,CADM1, CALCA, CAMK2N1, CAND1, CASC5, CAV1, CAV2, CCDC121, CCDC148,CCDC80, CCL13, CCND1, CCND1, CCND2, CD151, CD200, CD36, CDCP1, CDH11,CDH3, CDH6, CDK2, CDKL2, CDO1, CDON, CDR1, CEP55, CFB, CFH, CFHR1, CFI,CHAF1B, CHD4, CITED1, CKS2, CLC, CLDN1, CLDN10, CLDN16, CLDN4, CLDN7,CLEC4E, CLU, CNN3, COL1A1, COPZ2, CP, CPE, CRABP1, CRABP2, CSF3R, CST6,CTNNAL1, CTNNB1, CTSH, CTTN, CWH43, CXCL1, CXCL14, CXCL17, CXCL2, CXCL3,CXorf18, CXorf27, CYP24A1, CYP27A1, CYSLTR2, DAPK2, DCAF17, DCBLD2,DCUN1D3, DDAH1, DDB2, DDX52, DGKH, DGKI, DHRS1, DHRS3, DIO1, DIRAS3,DLC1, DLGAP5, DOCK9, DPP4, DPYSL3, DSG2, DSP, DST, DUOX1, DUOX2, DUOXA1,DUOXA2, DUSP4, DUSP5, DUSP6, DZIP1, ECE1, EDNRB, EGFR, EHBP1, EHD2, EHF,ELK3, ELMO1, EMP2, EMR3, ENAH, ENDOD1, EPB41, EPHA4, EPHX4, EPS8, ERBB3,ERI2, ERP27, ESRRG, ETNK2, ETV1, ETV5, F2RL2, FAAH2, FABP4, FAM111A,FAM111B, FAM164A, FAM176A, FAM20A, FAM84B, FAT4, FBXO2, FBXO21, FCN1,FCN2, FGF2, FGFR1OP2, FLJ20184, FLJ32810, FLJ42258, FLJ42258, FLRT3,FN1, FPR1, FPR2, FREM2, FRMD3, FXYD6, FZD4, FZD6, FZD7, G0S2, GABBR2,GABRB2, GADD45A, GALE, GALNT12, GALNT3, GALNT7, GBE1, GBP1, GBP3, GGCT,GLA, GLDN, GNG12, GOLT1A, GPR110, GPR110, GPR125, GPR98, GPRC5B, GRAMD3,GSN, GYPB, GYPC, GYPE, HEMGN, HEY2, HIST1H1A, HLA-DQB2, HMGA2, HMMR,HPN, HSD17B6, HSPH1, ICAM1, IGFBP5, IGFBP6, IGSF1, IKZF2, IL1RL1, IL2RA,IL8, IL8RA, IL8RB, IL8RB, IL8RBP, IL8RBP, IMPDH2, INPP5F, IPCEF1,IQGAP2, ITGA2, ITGA3, ITGA9, ITGB1, ITGB6, ITGB8, ITPR1, JUB, KAL1,KCNK5, KCNQ3, KCTD14, KDELC1, KDELR3, KHDRBS2, KIAA0284, KIAA0408,KIAA1217, KIF11, KIT, KLF8, KLK10, KLK7, KRT18, KRT19, LAMB3, LAMC1,LAMC2, LCA5, LCMT1, LCN2, LDOC1, LEMD1, LGALS3, LILRA1, LILRB1, LIMA1,LINGO2, LIPH, LMO3, LMO4, LOC100124692, LOC100127974, LOC100129112,LOC100129115, LOC100129171, LOC100129961, LOC100130248, LOC100131102,LOC100131490, LOC100131938, LOC100131993, LOC100132338, LOC100132764,LOC283508, LOC440434, L00554202, LOC643454, LOC648149, LOC653354,LOC653498, LOC730031, LONRF2, LOX, LPAR1, LPAR5, LPL, LRP1B, LRP2,LRRC69, LRRN1, LUM, LYRM1, MACC1, MAFG, MAMLD1, MAP2, MAPK6, MATN2,MBOAT2, MCM4, MCM7, MDK, ME1, MED13, MELK, MET, METTL7B, MEX3C, MFGE8,MGAM, MGAT1, MGAT4C, MGST1, MKI67, MLLT4, MMP16, MMP16, MNDA, MORC4,MPPED2, MPZL2, MRPL14, MT1F, MT1G, MT1H, MT1M, MT1P2, MT1P3, MTHFD1L,MUC1, MUC15, MVP, MXRA5, MYEF2, MYH10, MYO1B, MYO1D, MYO5A, MYO6, NAB2,NAE1, NAG20, NAV2, NCKAP1, NDC80, NDFIP2, NEDD4L, NELL2, NEXN, NFE2,NFIB, NIPAL3, NOD1, NPC2, NPEPPS, NPL, NPY1R, NRCAM, NRIP1, NRP2, NT5E,NUCB2, NUDT6, NUSAP1, OCIAD2, OCR1, ODZ1, ORAOV1, OSBPL1A, OSGEP, OSMR,P2RY13, P4HA2, PAM, PAPSS2, PARP4, PARVA, PBX1, PDE5A, PDE9A, PDGFRL,PDLIM1, PDLIM4, PDZRN4, PEG10, PERP, PGCP, PHEX, PHF16, PHLDB2, PHYHIP,PKHD1L1, PKP4, PLA2G16, PLA2G7, PLA2R1, PLAG1, PLAU, PLCD3, PLCL1,PLEKHA4, PLEKHA5, PLK2, PLS3, PLSCR4, PMEPA1, PON2, PPARGC1A, PPBP, PPL,PPP1R14C, PRCP, PRICKLE1, PRINS, PROK2, PROS1, PRR15, PRRG1, PRSS23,PSD3, PSD3, PTPN14, PTPRE, PTPRF, PTPRG, PTPRK, PTRF, QTRT1, RAB25,RAB27A, RAB32, RAB34, RAD23B, RAG2, RAI2, RAPGEF5, RARG, RASA1, RASD2,RBBP7, RBBP8, RBMS2, RCBTB2, RCE1, RDH5, RGS18, RGS2, RHOU, RND3, ROS1,RPL39L, RPRD1A, RPS6KA6, RRAS, RXRG, S100A12, S100A14, S100A16, S100A8,S100A9, SALL1, SAV1, SC4MOL, SCARA3, SCARNA11, SCEL, SCG5, SCNN1A,SCRN1, SDC4, SEH1L, SELL, SEMA3C, SEMA3D, SEPT11, SERINC2, SERPINA1,SERPINA2, SERPINE2, SERPING1, SFN, SFTPB, SGCB, SGCE, SGEF, SGMS2,SH2D4A, SH3PXD2A, SIRPA, SIRPB1, SLA, SLC12A2, SLC16A4, SLC16A6,SLC17A5, SLC24A5, SLC26A4, SLC26A7, SLC27A2, SLC27A6, SLC34A2, SLC35F2,SLC39A6, SLC4A4, SLC5A8, SLC7A11, SLC7A2, SLIT2, SLPI, SMOC2, SMURF2,SNCA, SNX1, SNX22, SNX7, SOAT1, SORBS2, SPATS2, SPATS2L, SPINT1, SPRED2,SPRY1, SPRY2, SRL, SSPN, ST3GAL5, STK32A, STXBP6, SULF1, SYNE1, SYT14,SYTL5, TACSTD2, TBC1D3F, TDRKH, TEAD1, TEAD1, TFCP2L1, TFF3, TFPI, TGFA,TGFB2, TGFBR1, TIMP1, TIPARP, TJP1, TJP2, TLCD1, TLR8, TM4SF1, TM4SF4,TM7SF4, TMEM100, TMEM117, TMEM133, TMEM163, TMEM171, TMEM215, TMEM90A,TMEM98, TMPRSS4, TNC, TNFRSF12A, TNFSF15, TOMM34, TPD52L1, TPO, TPX2,TRIP10, TRPC5, TSC22D1, TSPAN13, TSPAN6, TUBB1, TUBB6, TULP3, TUSC3,TXNRD1, TYMS, UCHL5, VNN1, VNN2, VNN3, WDR40A, WDR54, WIPI1, WNT5A,XKRX, XPR1, YIF1B, YTHDC2, ZBTB33, ZCCHC12, ZCCHC16, ZFP36L1, ZMAT3,ZMAT4, ZNF643, ZNF804B, ZYG11A.

(vii) List 10: MTC Subtype, n=48

ANXA3, ATP13A4, BLNK, C10orf131, C6orf174, C8orf79, CALCA, CHGB, CP,CPE, DSG2, FREM2, GPR98, IGJ, IYD, KIAA0408, LOC100129171, LPCAT2,LRRC69, MACC1, MAPK6, MGAT4C, MGST1, MMP16, MT1G, MT1H, MT1M, MT1P2,MT1P3, MUC15, MYEF2, NT5E, PKHD1L1, PLS3, RBMS2, RIMS2, SCG3, SEMA3D,SLA, SLC24A5, SMOC2, SULF1, TOX, TSHZ2, TSPAN6, WDR72, ZFP36L1, ZNF208.

(viii) List 11: HC Subtype, n=65

AIM2, APOBEC3F, APOBEC3G, ARHGAP19, BAG3, BCL2A1, BMP8A, C9orf68,CARD17, CARD8, CASP1, CD3D, CD96, CEP110, CLEC2B, CNN2, CPE, CYTH1,DENND4A, DNAJB14, DOCK8, DPYD, DUOX1, DUOX2, DYNC1I2, EGF, EPDR1, ETS1,GBP5, GIMAP2, GIMAP5, GIMAP7, GPR174, GZMK, HNRNPM, HSD17B6, IFI16,IFNAR2, IKZF3, IL7R, ITM2A, JAK2, KCNAB1, KHDRBS2, KLRC4, KLRG1, KLRK1,KYNU, LOC646358, MED13L, ND1, NFATC3, PAPSS2, PGCP, PTPRC, PYHIN1,SLIT1, SP140, SP140L, ST20, STAT4, TC2N, TLE4, ZEB2, ZNF143.

(ix) List 12: HA Subtype, n=24

BCL2, CADM1, CAV1, CRABP1, CTNNB1, CYTH1, DIRAS3, IFITM1, IGFBP5,IGFBP6, LOX, MAP2, MATN2, MET, MKI67, MYO1B, ND1, NUCB2, SCG5, SCNN1A,SEL1L3, SGCE, TNFSF10, TRPC6.

(x) List 13: ATC Subtype, n=12

CASC5, CEP55, COL12A1, DLGAP5, HMMR, KIF11, MELK, MKI67, NDC80, NUSAP1,PYGL, TPX2.

Dominant gene ontology of top 948 thyroid biomarkers are listed below:

List 14: Angiogenesis, n=23

ACTA2, ANXA2, ARHGAP24, CALCA, CAV1, CITED1, COL1A1, CXCL17, EGF, ELK3,IL8, LOX, PLCD3, PROK2, RASA1, SEMA3C, TCF7L2, TGFA, TGFB2, TIPARP,TNFRSF12A, ZFP36L1, ZFPM2.

List 15: Apoptosis, n=43

AHR, ANXA1, BAG3, BCL2, BCL2A1, BIRC5, C8orf4, CADM1, CD2, CLU, CTNNB1,DAPK2, DLC1, DNASE1L3, ECE1, ELMO1, FAM176A, FGF2, GADD45A, GULP1, GZMA,HIPK2, IL2RA, IL8RB, JAK2, NCKAP1, NOD1, NUPR1, PEG10, PERP, PROK2,RYR2, SLC5A8, STK17B, SULF1, TCF7L2, TGFB2, TNFAIP8, TNFRSF11B,TNFRSF12A, TNFSF10, VNN1, ZMAT3.

List 16: Cell Cycle, Transcription Factors, n=184

AEBP1, AHR, AK1, ANXA1, APOBEC3F, APOBEC3G, ARHGAP24, ARNTL, ATM, BCL2,BHLHE40, BHLHE41, BIRC5, BMP1, BMP8A, CADM1, CAND1, CARD8, CASP1, CCND1,CCND2, CDK2, CEP110, CEP55, CHAF1B, CHD4, CITED1, CKS2, CLU, CRABP2,CSGALNACT1, CTNNB1, CXCL1, CXCL17, DENND4A, DLGAP5, DST, DZIP1, EGF,EHF, EIF2B2, EIF4H, ELK3, EMP2, EPS8, ERBB2, ERBB3, ESRRG, ETS1, ETV1,ETV4, ETV5, FABP4, FGF2, G0S2, GADD45A, GLDN, GLIS3, GTF3A, HEMGN, HEY2,HIPK2, HLF, HMGA2, HPN, ID3, IFI16, IFNAR2, IGSF1, IKZF2, IKZF3, IKZF4,IL2RA, IL8, ITPR1, JAK2, JUB, KHDRBS2, KIF11, KLF8, KLK10, KRT18,LGALS3, LIFR, LMO3, LMO4, LRP2, LTBP2, LTBP3, MACC1, MAFG, MAMLD1,MAPK4, MAPK6, MCM4, MCM7, MDK, MED13, MED13L, MIS12, MKI67, MLLT3, MNDA,MTIF3, MYH10, NAB2, NAE1, NDC80, NFATC3, NFE2, NFIB, NFKBIZ, NOD1,NPAS3, NPAT, NRIP1, NRP2, NUDT6, NUPR1, NUSAP1, OSMR, PARD6B, PARP14,PARP4, PBX1, PDLIM1, PEG10, PIAS3, PLAG1, POU2F3, PPARGC1A, PPBP, PRMT6,PROK2, PTRF, PYHIN1, RARG, RBBP7, RBBP8, RGS2, RHOH, RRM2, RUNX1, RUNX2,RXRG, SALL1, SEMA3D, SERPINE2, SLIT1, SLIT2, SMAD9, SMURF2, SP140,SPC25, SPOCK1, STAT4, SYNE1, TACSTD2, TCF7L2, TCFL5, TEAD1, TFCP2L1,TGFA, TGFB2, TGFBR1, TLE4, TNFAIP8, TNFRSF12A, TNFRSF17, TPX2, TSC22D1,TSHZ2, TULP3, TYMS, WNT5A, ZBTB33, ZCCHC12, ZEB2, ZFP36L1, ZFPM2,ZNF143, ZNF208, ZNF487, ZNF643.

List 17: Cell Membrane, n=410

ABCC3, ABCD2, ACSL1, ADAMTS5, ADAMTS9, ADORA1, AFAP1, AK1, ALOX5,AMIGO2, ANK2, ANO5, AP3S1, APOL1, APOO, AQP4, AQP9, ARMCX3, ARMCX6,ASAP2, ATP13A4, ATP6V0D2, ATP8A1, AVPR1A, B3GNT3, BCL2, BLNK, BTBD11,C10orf72, C17orf87, C1orf115, C4orf34, C5orf28, C6orf174, CADM1,CAMK2N1, CAV1, CAV2, CCDC109B, CD151, CD180, CD2, CD200, CD36, CD3D,CD48, CD48, CD52, CD69, CD79A, CD96, CDCP1, CDH11, CDH3, CDH6, CDON,CFB, CFI, CHI3L1, CLDN1, CLDN10, CLDN16, CLDN4, CLDN7, CLEC2B, CLEC4E,COL12A1, COL1A1, COPZ2, CP, CPE, CR2, CSF3R, CSGALNACT1, CTNNAL1,CTNNB1, CWH43, CYP1B1, CYP27A1, CYP4B1, CYSLTR1, CYSLTR2, CYTH1, DCAF17,DCBLD2, DHRS3, DIO1, DIRAS3, DLG2, DLG4, DNAJB14, DOCK9, DPP4, DPYSL3,DSG2, DUOX1, DUOX2, DUOXA1, DUOXA2, ECE1, EDNRB, EFEMP1, EGF, EGFR,EHBP1, EHD2, ELMO1, EMP2, EMR3, ENTPD1, EPB41, EPHA4, EPHX4, ERBB2,ERBB3, ERO1LB, F2RL2, F8, FAAH2, FAM176A, FAM84B, FAT4, FBLN5, FLRT3,FN1, FPR1, FPR2, FREM2, FRMD3, FXYD6, FZD4, FZD6, FZD7, GABBR2, GABRB2,GALNT12, GALNT3, GALNT7, GBP1, GBP3, GBP5, GIMAP2, GIMAP5, GJA4, GLDN,GNG12, GOLT1A, GPAM, GPR110, GPR125, GPR155, GPR174, GPR98, GPRC5B,GYPB, GYPC, GYPE, HIGD1A, HK1, HLA-DPB1, HNRNPM, HPN, HSD17B6, ICAM1,IFITM1, IFNAR2, IGSF1, IL1RAP, IL1RL1, IL2RA, IL7R, IL8RA, IL8RB,IPCEF1, ITGA2, ITGA3, ITGA4, ITGA9, ITGB1, ITGB4, ITGB6, ITGB8, ITM2A,ITPR1, IYD, JAK2, JUB, KAL1, KCNA3, KCNAB1, KCNK5, KCNQ3, KCTD14,KDELR3, KIAA1305, KIT, KLRB1, KLRC4, KLRG1, KLRK1, LAMB1, LAMC1, LEMD1,LGALS3, LIFR, LILRA1, LILRB1, LINGO2, LIPH, LPAR1, LPAR5, LPCAT2, LPL,LRP1B, LRP2, LRRN1, LRRN3, LUM, MATN2, MBOAT2, MET, MFGE8, MGAM, MGAT1,MGAT4C, MGST1, MMP16, MPZL2, MRC2, MUC1, MUC15, MYH10, MYO6, NAE1,NCAM1, NCKAP1, ND1, NDFIP2, NIPAL3, NPY1R, NRCAM, NRP2, NT5E, NUCB2,ODZ1, OSMR, P2RY13, PAM, PARD6B, PARP14, PARVA, PCDH1, PCNXL2, PERP,PHEX, PHLDB2, PIGN, PKHD1L1, PKP2, PLA2G16, PLA2R1, PLAU, PLCD3, PLEK,PLEKHA4, PLP2, PLSCR4, PLXNC1, PMEPA1, PON2, POR, PPAP2C, PPL, PPP1R14C,PRICKLE′, PRRG1, PSD3, PTK7, PTPRC, PTPRE, PTPRF, PTPRG, PTPRK, PTPRU,PTRF, RAB25, RAB27A, RARG, RASA1, RASD2, RCE1, RDH5, RGS13, RHOH, RHOU,RIMS2, RND3, ROS1, RRAS, RRAS2, RRBP1, RYR2, S100A12, SC4MOL, SCARA3,SCEL, SCNN1A, SDC4, SDK1, SEL1L3, SELL, SEMA3C, SEMA3D, SEMA4C, SERINC2,SERPINA1, SGCB, SGCE, SGMS2, SGPP2, SIRPA, SIRPB1, SLC12A2, SLC16A4,SLC16A6, SLC17A5, SLC24A5, SLC25A33, SLC26A4, SLC26A7, SLC27A2, SLC27A6,SLC34A2, SLC35D2, SLC35F2, SLC39A6, SLC4A4, SLC5A8, SLC7A11, SLC7A2,SMURF2, SNCA, SNX1, SOAT1, SPINT1, SPOCK1, SPRED2, SPRY1, SPRY2, SQLE,SSPN, ST3GAL5, STEAP2, STXBP6, SYNE1, SYT14, SYTL5, TACSTD2, TFCP2L1,TFF3, TFPI, TGFA, TGFB2, TGFBR1, TIMP1, TJP1, TJP2, TLCD1, TLR10, TLR8,TM4SF1, TM4SF4, TM7SF4, TMEM100, TMEM117, TMEM133, TMEM156, TMEM163,TMEM171, TMEM215, TMEM220, TMEM90A, TMEM98, TMPRSS4, TNC, TNFRSF11B,TNFRSF12A, TNFRSF17, TNFSF10, TNFSF15, TOMM34, TPO, TRIP10, TRPC5,TRPC6, TSPAN13, TSPAN6, TSPAN8, TULP3, TUSC3, VAMP1, VNN1, VNN2, VNN3,WNT5A, XKRX, XPR1, YIF1B, YIPF1, ZBTB33.

List 18: Rare Membrane Components, n=55

AMOT, ANXA1, ANXA2, CALCA, CAMK2N1, CAV1, CAV2, CCDC80, CLU, CST6,CTNNB1, CTTN, DLC1, DPP4, DSG2, DSP, DST, ENAH, GJA4, HIPK2, ITGB1,ITGB4, JAK2, JUB, KRT19, LCP1, LRP2, MYH10, MYO5A, MYO6, NEB, PARVA,PCDH1, PERP, PKP2, PKP4, PLEK, PPL, PTRF, RAB34, RASA1, RYR2, SCEL,SGCB, SGCE, SLC27A6, SLIT1, SPRY1, SRL, SSPN, SYNE1, TGFB2, TIAM2, TJP1,TNFRSF12A.

List 19: Cell-Cell Adhesion, n=85

AEBP1, AFAP1, AMIGO2, ARHGAP24, BCL2, CADM1, CALCA, CD151, CD2, CD36,CD96, CDH3, CDH6, CDON, CLDN1, CLDN10, COL12A1, CSF3R, CTNNAL1, CTNNB1,DCBLD2, DLC1, DSG2, DST, EGFR, ENAH, ENTPD1, EPDR1, F8, FAT4, FBLN5,FLRT3, FN1, FPR2, FREM2, GPR98, ICAM1, IGFBP7, IL1RL1, ITGA2, ITGA3,ITGA4, ITGA9, ITGB1, ITGB4, ITGB6, ITGB8, JUB, KAL1, LAMB1, LAMB3,LAMC1, LAMC2, LIMA1, MFGE8, MLLT4, MPZL2, NCAM1, NELL2, NRCAM, NRP2,PARVA, PCDH1, PERP, PKP2, PKP4, PLXNC1, PTK7, PTPRC, PTPRF, PTPRK,PTPRU, RHOU, RND3, SDK1, SELL, SGCE, SIRPA, SPOCK1, SPP1, SSPN, TJP1,TNC, TNFRSF12A, VNN1.

List 20: Apical Cell Membrane, n=15

ANK2, ATP6V0D2, CTNNB1, CTNNB1, DPP4, DUOX1, ERBB2, ERBB3, F2RL2, FZD6,LRP2, SCNN1A, SLC26A4, SLC34A2, TFF3.

List 21: Basolateral, Lateral Cell Membrane, n=28

ANK2, ANXA1, ANXA2, CADM1, CCDC80, CTNNB1, CTTN, DSP, DST, EGFR, EPB41,ERBB2, ERBB3, FREM2, LAMB1, LAMB3, LAMC1, LAMC2, MET, MYH10, MYO6,PTPRK, SLC26A7, SMOC2, SNCA, TIMP3, TJP1, TRIP10.

List 22: Integrins, n=14

ADAMTS5, DST, FBLN5, ICAM1, ITGA2, ITGA3, ITGA4, ITGA9, ITGB1, ITGB4,ITGB6, ITGB8, MFGE8, PLEK.

List 23: Cell Junction, n=40

AMOT, ARHGAP24, ARHGAP24, CADM1, CAMK2N1, CLDN1, CLDN10, CLDN16, CLDN4,CLDN7, CNN2, DLG2, DLG4, DPYSL3, DSP, ENAH, GABBR2, GABRB2, GJA4, JUB,LIMA1, MLLT4, NCKAP1, NEXN, PARD6B, PARVA, PCDH1, PERP, PPL, PSD3,PTPRK, PTPRU, RHOU, RIMS2, SH3PXD2A, SSPN, TGFB2, TJP1, TJP2, VAMP1.

List 24: Cell Surface, n=17

CD36, DCBLD2, DPP4, GPR98, HMMR, IL1RL1, IL8RB, ITGA4, ITGB1, KAL1,MMP16, PTPRK, SDC4, SULF1, TGFA, TM7SF4, TNFRSF12A.

List 25: Extracellular Space, n=156

ADAMTS5, ADAMTS9, AEBP1, AGR2, ANGPTL1, ANXA2, APOL1, APOO, BMP1, BMP8A,C12orf49, C2, C2orf40, C3, C4A, C4B, C4orf7, CA11, CALCA, CCDC80, CCL13,CCL19, CDCP1, CFB, CFH, CFHR1, CFI, CHGB, CHI3L1, CLU, COL12A1, COL1A1,CP, CPE, CSF3R, CST6, CXCL1, CXCL11, CXCL13, CXCL14, CXCL17, CXCL2,CXCL3, CXCL9, DPP4, EFEMP1, EGF, EGFR, EMR3, ENDOD1, EPDR1, ERBB3, F8,FAM20A, FAM55C, FBLN5, FCN1, FCN2, FGF2, FIBIN, FN1, FXYD6, GLA, GSN,GZMA, GZMK, ICAM1, IFNAR2, IGFBP5, IGFBP6, IGFBP7, IGJ, IGKC, IGKV1-5,IGKV3-20, IGKV3D-11, IGSF1, IL1RAP, IL1RL1, IL7R, IL8, KAL1, KIT, KLK10,KLK7, LAMB1, LAMB3, LAMC1, LAMC2, LCN2, LIFR, LIPH, LOC652694, LOX, LPL,LTBP2, LTBP3, LUM, LYZ, MATN2, MDK, MFGE8, MMP16, MUC1, MUC15, MXRA5,NCAM1, NELL2, NPC2, NUCB2, ODZ1, PAM, PDGFRL, PGCP, PLA2G7, PLA2R1,PLAU, PON2, PPBP, PROK2, PROS1, PRRG1, PRSS23, PXDNL, RNASE1, RNASET2,SCG3, SCG5, SEMA3C, SEMA3D, SEPP1, SERPINA1, SERPINE2, SERPING1, SFN,SFTPB, SLIT1, SLIT2, SLPI, SMOC2, SPINT1, SPOCK1, SPP1, SULF1, TFF3,TFPI, TGFA, TGFB2, THSD4, TIMP1, TIMP3, TNC, TNFRSF11B, TNFSF10,TNFSF15, WNT5A.

List 26: Cytoskeleton, n=94

ACTA2, ADORA1, AFAP1, AMOT, ANK2, ANXA2, AP3S1, ARHGAP24, ATM, ATP8A1,BCL2, BIRC5, C2orf40, CASC5, CLU, CNN2, CNN3, COL12A1, COL1A1, COPZ2,CTNNAL1, CTNNB1, CTTN, CXCL1, DLG4, DLGAP5, DPYSL3, DST, DYNC1I2,DYNLT1, EGFR, ELMO1, ENAH, EPB41, EPS8, FAM82B, FRMD3, GPRC5B, GSN,GYPC, IGF2BP2, IQGAP2, JAK2, JUB, KATNAL2, KIAA0284, KIF11, KRT18, LCA5,LCP1, LIMA1, LOX, LUM, MAP2, MPZL2, MYH10, MYO1B, MYO1D, MYO5A, MYO6,NEB, NEXN, NFE2, NUSAP1, PARVA, PDLIM1, PKP2, PLEK, PLS3, PPL, PTPN14,RHOU, RND3, S100A9, SCNN1A, SDC4, SGCB, SGCE, SNCA, SORBS2, SPRED2,SPRY2, STK17B, SYNE1, TGFB2, TGFBR1, TMSB10, TMSB15A, TPX2, TRIP10,TUBB1, TUBB6, VAMP1, WIPI1.

In some embodiments, the present invention provides a method ofclassifying cancer comprising the steps of: obtaining a biologicalsample comprising gene expression products; determining the expressionlevel for one or more gene expression products of the biological sample;and identifying the biological sample as cancerous wherein the geneexpression level is indicative of the presence of thyroid cancer in thebiological sample. This can be done by correlating the gene expressionlevels with the presence of thyroid cancer in the biological sample. Inone embodiment, the gene expression products are selected from one ormore genes listed in Table 2. In some embodiments, the method furtherincludes identifying the biological sample as positive for a cancer thathas metastasized to thyroid from a non-thyroid organ if there is adifference in the gene expression levels between the biological sampleand a control sample at a specified confidence level.

Biomarkers involved in metastasis to thyroid from a non-thyroid organare provided. Such metastatic cancers that metastasize to thyroid andcan be diagnosed using the subject methods of the present inventioninclude but are not limited to metastatic parathyroid cancer, metastaticmelanoma, metastatic renal carcinoma, metastatic breast carcinoma, andmetastatic B cell lymphoma. Exemplary biomarkers that can be used by thesubject methods to diagnose metastasis to thyroid are listed in Table 2.

TABLE 2 Biomarkers involved in metastasis to thyroid Number of Type ofmetastasis genes Genes Top Biomarkers of Non- 73 ACADL, ATP13A4, BIRC5,BTG3, C2orf40, C7orf62, CD24, thyroid Metastases to the CHEK1, CP,CRABP1, CXADR, CXADRP2, DIO1, DIO2, Thyroid EPCAM, EPR1, GPX3, HSD17B6,IQCA1, IYD, KCNJ15, KCNJ16, KRT7, LMO3, LOC100129258, LOC100130518,LPCAT2, LRRC2, LRRC69, MAL2, MAPK6, MGAT4C, MGC9913, MT1F, MT1G, MT1H,MT1P2, MUC15, NEBL, NPNT, NTRK2, PAR1, PCP4, PDE1A, PDE8B, PKHD1L1,PLS3, PVRL2, PVRL3, RGN, RPL3, RRM2, SCD, SEMA3D, SH3BGRL2, SLC26A4,SLC26A7, SNRPN, SPC25, SYT14, TBCKL, TCEAL2, TCEAL4, TG, TPO, TSHR,WDR72, ZBED2, ZNF208, ZNF43, ZNF676, ZNF728, ZNF99 ParathyroidMetastasis to 101 TCID-2688277, ACSL3, ACTR3B, ADAM23, ADH5, ARP11,Thyroid AS3MT, BANK1, C10orf32, C11orf41, C2orf67, C7orf62, C8orf34,CA8, CASR, CD109, CD226, CD24, CD44, CDCA7L, CHEK1, CLDN1, CP, DIO2,DMRT2, DNAH11, DPP4, ELOVL2, ENPEP, EPHA7, ESRRG, EYA1, FMN2, GCM2,GPR160, GPR64, HSD17B6, ID2, ID2B, IYD, KIDINS220, KIF13B, KL, LGI2,LMO3, LOC100131599, LOC150786, LPL, LRRC69, MAPK6, MGST1, MT1F, MT1G,MT1H, MT1P2, MUC15, NAALADL2, NPNT, OGN, PDE8B, PEX5L, PKHD1L1, PLA2G4A,PLCB1, PRLR, PTH, PTN, PTPRD, PTTG1, PTTG2, PVALB, PVRL2, RAB6A, RAB6C,RAPGEF5, RARRES2, RGN, RNF217, RPE, SACS, SEMA3D, SGK1, SLA, SLC15A1,SLC26A4, SLC26A7, SLC7A8, SPOCK3, ST3GAL5, STXBP5, SYCP2L, TBCKL, TG,TINF2, TMEM167A, TPO, TSHR, TTR, WDR72, YAP1, ZBED2 Melanoma Metastasisto 190 TCID-2840750, ABCB5, AHNAK2, ALX1, ANLN, AP1S2, Thyroid APOD,ASB11, ATP13A4, ATP1B1, ATRNL1, AZGP1, BACE2, BAMBI, BCHE, BIRC5, BRIP1,BZW1, BZW1L1, C2orf40, C6orf218, C7orf62, CA14, CASC1, CCNB2, CD24,CDH19, CDK2, CDKN3, CENPF, CHRNA5, CP, CRABP1, DCT, DEPDC1, DIO1, DIO2,DLGAP5, DSCC1, DSP, EDNRB, EIF1AY, EIF4A1, ENPP1, EPCAM, EPR1, ESRP1,FABP7, FANCI, GAS2L3, GGH, GPM6B, GPNMB, GPR19, GPX3, GULP1, GYG2, HAS2,HEATR5A, HMCN1, HTN1, IL13RA2, IQCA1, IYD, KCNJ15, KCNJ16, KIAA0894,KIF23, KRT7, KRTAP19-1, LGALS1, LMO3, LOC100129171, LOC100129258,LOC100130275, LOC100130357, LOC100130518, LOC100131821, LOC145694,LOC653653, LRP2, LRRC69, LSAMP, LUM, MAL2, MAPK6, MGC87042, MITF, MLANA,MME, MND1, MOXD1, MSMB, MUC15, NDC80, NEBL, NLGN1, NOX4, NPNT, NTRK2,NUDT10, NUDT11, PAX3, PBK, PCP4, PDE3B, PDE8B, PI15, PIGA, PIR, PKHD1L1,PLP1, PLXNC1, POLG, POMGNT1, POPDC3, POSTN, PRAME, PRAMEL, PTPRZ1,PVRL2, PYGL, QPCT, RGN, RNF128, ROPN1, ROPN1B, RPL3, RPSA, RPSAP15,RPSAP58, S100B, SACS, SAMD12, SCD, SEMA3C, SERPINA3, SERPINE2, SERPINF1,SHC4, SILV, SLA, SLC16A1, SLC26A4, SLC26A7, SLC39A6, SLC45A2, SLC5A8,SLC6A15, SNAI2, SNCA, SNORA48, SNORA67, SORBS1, SPC25, SPP1, SPRY2,SRPX, ST3GAL6, STEAP1, STK33, TBC1D7, TBCKL, TCEAL2, TCEAL4, TCN1, TF,TFAP2A, TG, TIMP2, TMSB15A, TMSB15B, TNFRSF11B, TOP2A, TPO, TPX2, TRPM1,TSHR, TSPAN1, TUBB4, TYR, TYRL, TYRP1, WDR72, ZBED2, ZNF208, ZNF43,ZNF676, ZNF728, ZNF99 Renal Carcinoma 130 TCID-2763154, ADFP, AKR1C3,ALPK2, APOL1, ASPA, Metastasis to Thyroid ATP13A4, ATP8A1, BHMT, BHMT2,BICC1, BIRC3, C12orf75, C1S, C2orf40, C3, C7orf62, CA12, CDH6, CLRN3,CP, CYB5A, DAB2, DEFB1, DIO2, EFNA5, EGLN3, EIF1AY, ENPEP, ENPP1, ENPP3,EPCAM, ESRP1, FABP6, FABP7, FAM133B, FCGR3A, FCGR3B, FXYD2, GAS2L3,GLYAT, GSTA1, GSTA2, GSTA5, HAVCR1, HLA-DQA1, HPS3, IGFBP3, IL20RB, IYD,KMO, LEPREL1, LMO3, LOC100101266, LOC100129233, LOC100129518,LOC100130232, LOC100130518, LOC100133763, LOC728640, LOX, LRRC69, MAPK6,MGC9913, MME, MMP7, MT1G, MUC15, NEBL, NLGN1, NNMT, NPNT, NR1H4, OPN3,OSMR, PCOLCE2, PCP4, PDE8B, PDZK1IP1, PIGA, PKHD1L1, POSTN, PREPL,PTHLH, RPS6KA6, S100A10, SAA1, SAA2, SCD, SLC16A1, SLC16A4, SLC17A3,SLC26A4, SLC26A7, SLC3A1, SLCO4C1, SNX10, SOD2, SPINK1, SPP1, SYT14,TBCKL, TCEAL2, TCEAL4, TG, TMEM161B, TMEM176A, TMEM45A, TNFAIP6,TNFSF10, TPO, TSHR, UGT1A1, UGT1A10, UGT1A3, UGT1A4, UGT1A5, UGT1A6,UGT1A7, UGT1A8, UGT1A9, UGT2A3, UGT2B7, VCAM1, VCAN, ZNF208, ZNF43,ZNF676, ZNF728, ZNF99 Breast Carcinoma 117 TCID-3777770, ACADL, AGR2,AGR3, ALDH1A1, ANLN, Metastasis to Thyroid ASPM, ATP13A4, AZGP1, BIRC5,BRIP1, C10orf81, C7orf62, C8orf79, CA2, CCNB2, CCNE2, CDC2, CDC6, CDKN3,CENPF, CHEK1, CP, CSNK1G1, DEPDC1, DIO1, DIO2, DLGAP5, DTL, EHF, EPR1,EZH2, FAM111B, FANCI, GALNT5, GPX3, HHEX, HPS3, IQCA1, ITGB3, IYD,KCNJ15, KCNJ16, KIAA0101, KIF23, LMO3, LOC100129258, LOC100130518,LOC100131821, LOC145694, LRP2, LRRC2, LRRC69, MAPK6, MELK, MGAT4C,MKI67, MND1, MUC15, MYB, NDC80, NPY1R, NUF2, NUSAP1, PAR1, PARP8, PBK,PCP4, PDE1A, PDE8B, PI15, PIP, PKHD1L1, POLG, PPARGC1A, PRC1, PVRL2,PVRL3, RAD51AP1, RGN, RPL3, RRM2, SAA1, SAA2, SCD, SCGB1D2, SCGB2A2,SEMA3C, SERPINA3, SLA, SLC26A4, SLC26A7, SNRPN, SPC25, ST3GAL5, STK33,SULT1C2, SYCP2, SYT14, TFF1, TG, THBS1, TOP2A, TPO, TPX2, TRPS1, TSHR,TTK, UNQ353, VTCN1, WDR72, ZBED2, ZNF208, ZNF43, ZNF676, ZNF728, ZNF99 Bcell Lymphoma 160 ACADL, AIM2, ALDH1A1, ALG9, APP, ARHGAP29, Metastasisto Thyroid ATP13A4, ATP1B1, BCL2A1, BIRC3, BIRC5, BTG3, C11orf74,C2orf40, C7orf62, CALCRL, CALD1, CD180, CD24, CD48, CD52, CD53, CDH1,CNN3, COX11, CP, CPE, CR2, CRYAB, CXADR, CXADRP2, CXorf65, DCBLD2, DIO2,DLGAP5, DSP, EAF2, EFCAB2, ENPP1, EPCAM, EPR1, ESRP1, FABP4, FDXACB1,FNBP1L, GJA1, GNAI1, GNG12, GPR174, GPX3, GTSF1, HCG11, IKZF3, IL2RG,IQCA1, IYD, KCNJ16, KLHL6, LAPTM5, LCP1, LIFR, LMO3, LOC100128219,LOC100129258, LOC100130518, LOC100131821, LOC100131938, LOC647979,LOC729828, LPCAT2, LPHN2, LRIG3, LRMP, LRP2, LRRC6, LRRC69, MAL2, MAOA,MAPK6, MATN2, MCOLN2, MGC9913, MGP, MKI67, MS4A1, MT1F, MT1G, MT1H,MT1L, MT1P2, MUC15, NCKAP1, NCKAP1L, NEBL, NME5, NPNT, NUDT12, PAR1,PBX1, PCP4, PDE8B, PDK4, PERP, PFN2, PKHD1L1, PLOD2, PLS3, POMGNT1,PPARGC1A, PPIC, PTPRC, PTPRM, PVRL3, RASEF, RGN, RGS13, RGS5, RHOH,RPL3, RPL37AP8, RRM2, S100A1, S100A13, SDC2, SELL, SEMA3D, SH3BGRL2,SLC26A4, SLC26A7, SMARCA1, SNRPN, SP140, SP140L, SPARCL1, SPC25, SPTLC3,ST20, STK17B, SYT14, TBCKL, TCEAL2, TCEAL4, TEAD1, TG, TJP1, TLR10,TOM1L1, TOP2A, TSHR, TSPAN1, TSPAN6, UACA, VNN2, WBP5, WDR72, ZNF208,ZNF43, ZNF676, ZNF728, ZNF99

(viii) Classification Error Rates

In some embodiments, top thyroid biomarkers (948 genes) are subdividedinto bins (50 TCIDs per bin) to demonstrate the minimum number of genesrequired to achieve an overall classification error rate of less than 4%(FIG. 1). The original TCIDs used for classification correspond to theAffymetrix Human Exon 1.0ST microarray chip and each may map to morethan one gene or no genes at all (Affymetrix annotation file:HuEx-1_0-st-v2.na29.hg18.transcript.csv). When no genes map to a TCIDthe biomarker is denoted as TCID-######.

List 27: Error Rate Bin 1 (TCID 1-50 (n=50), Gene Symbols, n=58)

AMIGO2, C11orf72, C11orf80, C6orf174, CAMK2N1, CDH3, CITED1, CLDN1,CLDN16, CST6, CXorf27, DLC1, EMP2, ERBB3, FZD4, GABRB2, GOLT1A, HEY2,HMGA2, IGFBP6, ITGA2, KCNQ3, KIAA0408, KRT19, LIPH, LOC100129115, MACC1,MDK, MET, METTL7B, MFGE8, MPZL2, NAB2, NOD1, NRCAM, PDE5A, PDLIM4,PHYHIP, PLAG1, PLCD3, PRICKLE′, PROS1, PRR15, PRSS23, PTPRF, QTRT1,RCE1, RDH5, ROS1, RXRG, SDC4, SLC27A6, SLC34A2, SYTL5, TNFRSF12A, TRPC5,TUSC3, ZCCHC12.

List 28: Error Rate Bin 2 (TCID 51-100 (n=50), Gene Symbols, n=59)

AHNAK2, AIDA, AMOT, ARMCX3, BCL9, C1orf115, C1orf116, C4A, C4B,C6orf168, CCDC121, CCND1, CDH6, CFI, CLDN10, CLU, CRABP2, CXCL14, DOCK9,DZIP1, EDNRB, EHD2, ENDOD1, EPHA4, EPS8, ETNK2, FAM176A, FLJ42258, HPN,ITGA3, ITGB8, KCNK5, KLK10, LAMB3, LEMD1, LOC100129112, LOC100132338,L00554202, MAFG, MAMLD1, MED13, MYH10, NELL2, PCNXL2, PDE9A, PLEKHA4,RAB34, RARG, SCG5, SFTPB, SLC35F2, SLIT2, TACSTD2, TGFA, TIMP1, TMEM100,TMPRSS4, TNC, ZCCHC16.

List 29: Error Rate Bin 3 (TCID 101-150 (n=50), Gene Symbols, n=52)

ABTB2, ADAMTS9, ADORA1, B3GNT3, BMP1, C19orf33, C3, CDH11, CLIP3,COL1A1, CXCL17, CYSLTR2, DAPK2, DHRS3, DIRAS3, DPYSL3, DUSP4, ECE1,FBXO2, FGF2, FN1, GALE, GPRC5B, GSN, IKZF4, IQGAP2, ITGB4, KIAA0284,KLF8, KLK7, LONRF2, LPAR5, MPPED2, MUC1, NRIP1, NUDT6, ODZ1, PAM,POU2F3, PPL, PTRF, RAPGEF5, RASD2, SCARA3, SCEL, SEMA4C, SNX22, SPRY1,SSPN, TM4SF4, XPR1, YIF1B.

List 30: Error Rate Bin 4 (TCID 151-200 (n=50), Gene Symbols, n=58)

AFAP1, ARMCX6, ARNTL, ASAP2, C2, C8orf4, CCDC148, CFB, CHAF1B, CLDN4,DLG4, DUSP6, ELMO1, FAAH2, FAM20A, FLRT3, FRMD3, GALNT12, GALNT7,IGFBP5, IKZF2, ISYNA1, LOC100131490, LOC648149, LOC653354, LRP1B, MAP2,MRC2, MT1F, MT1G, MT1H, MT1P2, MYEF2, NPAS3, PARD6B, PCDH1, PMEPA1,PPAP2C, PSD3, PTPRK, PTPRU, RAI2, RRAS, SDK1, SERPINA1, SERPINA2, SGMS2,SLC24A5, SMURF2, SPATS2L, SPINT1, TDRKH, TIPARP, TM4SF1, TMEM98, WNT5A,XKRX, ZMAT4.

List 31: Error Rate Bin 5 (TCID 201-250 (n=50), Gene Symbols, n=53)

ABCC3, AEBP1, C16orf45, C19orf33, CA11, CCND2, CDO1, CYP4B1, DOK4,DUSP5, ETV4, FAM111A, FN1, GABBR2, GGCT, GJA4, GPR110, HIPK2, ITGA9,JUB, KDELR3, KIAA1217, LAMC2, LCA5, LTBP2, LTBP3, MAPK6, NAV2, NIPAL3,OSMR, PDZRN4, PHLDB2, PIAS3, PKHD1L1, PKP2, PKP4, PRINS, PTK7, PTPRG,RAB27A, RAD23B, RASA1, RICH2, SCRN1, SFN, ST3GAL5, STK32A, TCERG1L,THSD4, TJP2, TM7SF4, TPO, YIF1B.

IX. Compositions

(i) Gene Expression Products and Splice Variants of the PresentInvention

Molecular profiling may also include but is not limited to assays of thepresent disclosure including assays for one or more of the following:proteins, protein expression products, DNA, DNA polymorphisms, RNA, RNAexpression products, RNA expression product levels, or RNA expressionproduct splice variants of the genes provided in FIG. 2-6, 9-13, 16 or17. In some cases, the methods of the present invention provide forimproved cancer diagnostics by molecular profiling of about 1, 2, 3, 4,5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100,120, 140, 160, 180, 200, 240, 280, 300, 350, 400, 450, 500, 600, 700,800, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 5000 or more DNApolymorphisms, expression product markers, and/or alternative splicevariant markers.

In one embodiment, molecular profiling involves microarray hybridizationthat is performed to determine gene expression product levels for one ormore genes selected from: FIG. 2-6, 9-13, 16 or 17. In some cases, geneexpression product levels of one or more genes from one group arecompared to gene expression product levels of one or more genes inanother group or groups. As an example only and without limitation, theexpression level of gene TPO may be compared to the expression level ofgene GAPDH. In another embodiment, gene expression levels are determinedfor one or more genes involved in one or more of the following metabolicor signaling pathways: thyroid hormone production and/or release,protein kinase signaling pathways, lipid kinase signaling pathways, andcyclins. In some cases, the methods of the present invention provide foranalysis of gene expression product levels and or alternative exon usageof at least one gene of 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, or15 or more different metabolic or signaling pathways.

(ii) Compositions of the Present Invention

Compositions of the present disclosure are also provided whichcomposition comprises one or more of the following: nucleotides (e.g.DNA or RNA) corresponding to the genes or a portion of the genesprovided in FIG. 2-6, 9-13, 16 or 17, and nucleotides (e.g. DNA or RNA)corresponding to the complement of the genes or a portion of thecomplement of the genes provided in FIG. 2-6, 9-13, 16 or 17. Thenucleotides of the present invention can be at least about 10, 15, 20,25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 100, 150, 200, 250, 300,350, or about 400 or 500 nucleotides in length. In some embodiments ofthe present invention, the nucleotides can be natural or man-madederivatives of ribonucleic acid or deoxyribonucleic acid including butnot limited to peptide nucleic acids, pyranosyl RNA, nucleosides,methylated nucleic acid, pegylated nucleic acid, cyclic nucleotides, andchemically modified nucleotides. In some of the compositions of thepresent invention, nucleotides of the present invention have beenchemically modified to include a detectable label. In some embodimentsof the present invention the biological sample has been chemicallymodified to include a label.

A further composition of the present disclosure comprisesoligonucleotides for detecting (i.e. measuring) the expression productsof the genes provided in FIG. 2-6, 9-13, 16 or 17 and their complement.A further composition of the present disclosure comprisesoligonucleotides for detecting (i.e. measuring) the expression productsof polymorphic alleles of the genes provided in FIG. 2-6, 9-13, 16 or 17and their complement. Such polymorphic alleles include but are notlimited to splice site variants, single nucleotide polymorphisms,variable number repeat polymorphisms, insertions, deletions, andhomologues. In some cases, the variant alleles are between about 99.9%and about 70% identical to the genes listed in FIG. 6, including about99.75%, 99.5%, 99.25%, 99%, 97.5%, 95%, 92.5%, 90%, 85%, 80%, 75%, andabout 70% identical. In some cases, the variant alleles differ bybetween about 1 nucleotide and about 500 nucleotides from the genesprovided in FIG. 2-6, 9-13, 16 or 17, including about 1, 2, 3, 5, 7, 10,15, 20, 25, 30, 35, 50, 75, 100, 150, 200, 250, 300, and about 400nucleotides.

In some embodiments, the composition of the present invention may bespecifically selected from the top differentially expressed geneproducts between benign and malignant samples, or the top differentiallyspliced gene products between benign and malignant samples, or the topdifferentially expressed gene products between normal and benign ormalignant samples, or the top differentially spliced gene productsbetween normal and benign or malignant samples. In some cases the topdifferentially expressed gene products may be selected from FIG. 2and/or FIG. 4. In some cases, the top differentially spliced geneproducts may be selected from FIG. 3 and/or FIG. 5.

IX. Business Methods

As described herein, the term customer or potential customer refers toindividuals or entities that may utilize methods or services of themolecular profiling business. Potential customers for the molecularprofiling methods and services described herein include for example,patients, subjects, physicians, cytological labs, health care providers,researchers, insurance companies, government entities such as Medicaid,employers, or any other entity interested in achieving more economicalor effective system for diagnosing, monitoring and treating cancer.

Such parties can utilize the molecular profiling results, for example,to selectively indicate expensive drugs or therapeutic interventions topatients likely to benefit the most from said drugs or interventions, orto identify individuals who would not benefit or may be harmed by theunnecessary use of drugs or other therapeutic interventions.

(i) Methods of Marketing

The services of the molecular profiling business of the presentinvention may be marketed to individuals concerned about their health,physicians or other medical professionals, for example as a method ofenhancing diagnosis and care; cytological labs, for example as a servicefor providing enhanced diagnosis to a client; health care providers,insurance companies, and government entities, for example as a methodfor reducing costs by eliminating unwarranted therapeutic interventions.Methods of marketing to potential clients, further includes marketing ofdatabase access for researchers and physicians seeking to find newcorrelations between gene expression products and diseases orconditions.

The methods of marketing may include the use of print, radio,television, or internet based advertisement to potential customers.Potential customers may be marketed to through specific media, forexample, endocrinologists may be marketed to by placing advertisementsin trade magazines and medical journals including but not limited to TheJournal of the American Medical Association, Physicians Practice,American Medical News, Consultant, Medical Economics, Physician's MoneyDigest, American Family Physician, Monthly Prescribing Reference,Physicians' Travel and Meeting Guide, Patient Care, Cortlandt Forum,Internal Medicine News, Hospital Physician, Family Practice Management,Internal Medicine World Report, Women's Health in Primary Care, FamilyPractice News, Physician's Weekly, Health Monitor, The Endocrinologist,Journal of Endocrinology, The Open Endocrinology Journal, and TheJournal of Molecular Endocrinology. Marketing may also take the form ofcollaborating with a medical professional to perform experiments usingthe methods and services of the present invention and in some casespublish the results or seek funding for further research. In some cases,methods of marketing may include the use of physician or medicalprofessional databases such as, for example, the American MedicalAssociation (AMA) database, to determine contact information.

In one embodiment methods of marketing comprises collaborating withcytological testing laboratories to offer a molecular profiling serviceto customers whose samples cannot be unambiguously diagnosed usingroutine methods.

(ii) Business Methods Utilizing a Computer

The molecular profiling business may utilize one or more computers inthe methods of the present invention such as a computer 800 asillustrated in FIG. 22. The computer 800 may be used for managingcustomer and sample information such as sample or customer tracking,database management, analyzing molecular profiling data, analyzingcytological data, storing data, billing, marketing, reporting results,or storing results. The computer may include a monitor 807 or othergraphical interface for displaying data, results, billing information,marketing information (e.g. demographics), customer information, orsample information. The computer may also include means for data orinformation input 816, 815. The computer may include a processing unit801 and fixed 803 or removable 811 media or a combination thereof. Thecomputer may be accessed by a user in physical proximity to thecomputer, for example via a keyboard and/or mouse, or by a user 822 thatdoes not necessarily have access to the physical computer through acommunication medium 805 such as a modem, an internet connection, atelephone connection, or a wired or wireless communication signalcarrier wave. In some cases, the computer may be connected to a server809 or other communication device for relaying information from a userto the computer or from the computer to a user. In some cases, the usermay store data or information obtained from the computer through acommunication medium 805 on media, such as removable media 812. It isenvisioned that data relating to the present invention can betransmitted over such networks or connections for reception and/orreview by a party. The receiving party can be but is not limited to anindividual, a health care provider or a health care manager. In oneembodiment, a computer-readable medium includes a medium suitable fortransmission of a result of an analysis of a biological sample, such asexosome bio-signatures. The medium can include a result regarding anexosome bio-signature of a subject, wherein such a result is derivedusing the methods described herein.

The molecular profiling business may enter sample information into adatabase for the purpose of one or more of the following: inventorytracking, assay result tracking, order tracking, customer management,customer service, billing, and sales. Sample information may include,but is not limited to: customer name, unique customer identification,customer associated medical professional, indicated assay or assays,assay results, adequacy status, indicated adequacy tests, medicalhistory of the individual, preliminary diagnosis, suspected diagnosis,sample history, insurance provider, medical provider, third partytesting center or any information suitable for storage in a database.Sample history may include but is not limited to: age of the sample,type of sample, method of acquisition, method of storage, or method oftransport.

The database may be accessible by a customer, medical professional,insurance provider, third party, or any individual or entity which themolecular profiling business grants access. Database access may take theform of electronic communication such as a computer or telephone. Thedatabase may be accessed through an intermediary such as a customerservice representative, business representative, consultant, independenttesting center, or medical professional. The availability or degree ofdatabase access or sample information, such as assay results, may changeupon payment of a fee for products and services rendered or to berendered. The degree of database access or sample information may berestricted to comply with generally accepted or legal requirements forpatient or customer confidentiality. The molecular profiling company maybill the individual, insurance provider, medical provider, or governmententity for one or more of the following: sample receipt, sample storage,sample preparation, cytological testing, molecular profiling, input andupdate of sample information into the database, or database access.

(iii) Business Flow

FIG. 18a is a flow chart illustrating one way in which samples might beprocessed by the molecular profiling business. Samples of thyroid cells,for example, may be obtained by an endocrinologist perhaps via fineneedle aspiration 100. Samples are subjected to routine cytologicalstaining procedures 125. Said routine cytological staining provides fourdifferent possible preliminary diagnoses non-diagnostic 105, benign 110,ambiguous or suspicious 115, or malignant 120. The molecular profilingbusiness may then analyze gene expression product levels as describedherein 130. Said analysis of gene expression product levels, molecularprofiling, may lead to a definitive diagnosis of malignant 140 or benign135. In some cases only a subset of samples are analyzed by molecularprofiling such as those that provide ambiguous and non-diagnosticresults during routine cytological examination. Alternative embodimentsby which samples may be processed by the methods of the presentinvention are provided in FIGS. 18b and 21.

In some cases the molecular profiling results confirms the routinecytological test results. In other cases, the molecular profilingresults differ. In such cases, samples may be further tested, data maybe reexamined, or the molecular profiling results or cytological assayresults may be taken as the correct diagnosis. Benign diagnoses may alsoinclude diseases or conditions that, while not malignant cancer, mayindicate further monitoring or treatment. Similarly, malignant diagnosesmay further include diagnosis of the specific type of cancer or aspecific metabolic or signaling pathway involved in the disease orcondition. Said diagnoses, may indicate a treatment or therapeuticintervention such as radioactive iodine ablation, surgery,thyroidectomy; or further monitoring.

XI. Kits

The molecular profiling business may provide a kit for obtaining asuitable sample. Said kit 203 as depicted in FIG. 19 may comprise acontainer 202, a means for obtaining a sample 200, reagents for storingthe sample 205, and instructions for use of said kit. In anotherembodiment, the kit further comprises reagents and materials forperforming the molecular profiling analysis. In some cases, the reagentsand materials include a computer program for analyzing the datagenerated by the molecular profiling methods. In still other cases, thekit contains a means by which the biological sample is stored andtransported to a testing facility such as the molecular profilingbusiness or a third party testing center.

The molecular profiling business may also provide a kit for performingmolecular profiling. Said kit may comprise a means for extractingprotein or nucleic acids including all necessary buffers and reagents;and, a means for analyzing levels of protein or nucleic acids includingcontrols, and reagents. The kit may further comprise software or alicense to obtain and use software for analysis of the data providedusing the methods and compositions of the present invention.

EXAMPLES Example 1: Gene Expression Product Analysis of Thyroid Samples

75 thyroid samples were examined for gene expression analysis using theAffymetrix Human Exon 10ST array according to manufacturer'sinstructions to identify genes that showed significantly differentialexpression and/or alternative splicing between malignant, benign, andnormal samples. Three groups were compared and classified according topathological surgical diagnosis of the tissue: benign (n=29), malignant(n=37), and normal (n=9). The samples were prepared from surgicalthyroid tissue, snap frozen and then the RNA was prepared by standardmethods. The names and pathological classification of the 75 samples aredepicted in FIG. 1.

Microarray analysis was run with XRAY version 2.69 (Biotique SystemsInc.). Input files were normalized with full quantile normalization(Irizarry et al. Biostatistics 2003 Apr. 4 (2): 249-64). For each inputarray and each probe expression value, the array −ith percentile probevalue was replaced with the average of all array −ith percentile points.A total of 6,553,590 probes were manipulated in the analysis. Probeswith GC count less than 6 and greater than 17 were excluded from theanalysis. The expression score for each probe-set was derived viaapplication of median-polish (exon RMA) to the probe scores across allinput hybridizations and probe-sets with fewer than 3 probes (that passall of the tests defined above) were excluded from further analysis.Only ‘Core’ probe-sets, corresponding to probe-sets matching entries inthe high quality databases RefSeq and Ensembl, were analyzed.Non-expressed probes and invariant probes were also removed fromanalysis for both gene level and probe set level analyses. One-way ANOVAanalysis was used to examine gene expression at the probe set levelbetween groups malignant and benign.

The top 100 differentially expressed genes by gene level analysis (i.e.those genes which showed the greatest differential expression) wereobtained from the dataset in which benign malignant and normal thyroidsamples were compared. Markers were selected based on statisticalsignificance after Benjamini and Hochberg correction for false discoverrate (FDR). An FDR filter value of p<0.01 was used, followed by rankingwith absolute fold change (>1.9) calculated per maker as the highestdifferential gene expression value in any group (benign malignant ornormal) divided by the lowest differential expression in the remainingtwo groups. The results of this analysis are shown in FIG. 2. This tablelists three sets of calculated fold changes for any given marker toallow comparison between the groups. The fold changes malignant/benign,malignant/normal, and benign/normal were all calculated by dividing theexpression of one group by the expression of another.

The top 100 alternatively spliced genes were obtained from the datasetin which benign malignant and normal thyroid samples were compared.Markers were selected based on statistical significance after Benjaminiand Hochberg correction for false discovery rate (FDR). An FDR filtervalue of p<0.01 was used, and markers were ranked starting with lowestp-value. The threshold for listing a numerical value with the softwareused was p<1.0E-301, any numbers having a smaller p-value wereautomatically assigned a value of 0.00E+00. The results of this analysisare shown in FIG. 3. All the markers depicted are highly significant foralternative exon splicing.

The top 100 differentially expressed genes in the thyroid samples fromFIG. 1 by probe-set level analysis were obtained from the dataset inwhich benign and malignant samples were analyzed. Markers were selectedbased on significance after Benjamini and Hochberg correction for falsediscovery rate (FDR). Markers were selected based on significance afterBenjamini and Hochberg correction for false discovery rate (FDR). An FDRfilter value of p<0.01 was used, followed by ranking with absolutefold-change (>2.0) calculated per marker as Malignant expression dividedby Benign expression. The results of this analysis are shown in FIG. 4.

The top 100 statistically significant diagnostic markers determined bygene level analysis of the thyroid samples shown in FIG. 1 were alsocompiled. Data from the comparison between benign, malignant, and normaland from comparison between benign and malignant datasets were used.Markers were selected based on significance after Benjamini and Hochbergcorrection for false discovery rate (FDR). An FDR filter value of p<0.01was used, followed by ranking with absolute fold-change (>1.6)calculated per marker as the highest differential expression value inany group (benign, malignant or normal) divided by the lowestdifferential expression in the remaining two groups. The fold-changesfor Malignant/Benign, Malignant/Normal, and Benign/Normal were allcalculated in similar fashion by dividing the expression of one group bythe expression of another. The results of this analysis are shown inFIG. 5.

The full list of 4918 genes identified as statistically significantlydifferentially expressed, differentially spliced or both between benignand malignant, benign and normal, or malignant and normal samples ateither the probe-set or gene level was also compiled. Markers wereselected based on statistical significance after Benjamini and Hochbergcorrection for false discovery rate (FDR), and an FDR filter value ofp<0.01 was used. The results are depicted in FIG. 6.

Example 2: Gene Expression Product Analysis of Thyroid Tissue Samples

A total of 205 thyroid tissue samples (FIG. 7) are examined with anAffymetrix HumanExon10ST array chip to identify genes that differsignificantly in RNA expression levels between benign and malignantsamples. Samples are classified according to post-surgical thyroidpathology: samples exhibiting follicular adenoma (FA), lymphocyticthyroiditis (LCT), or nodular hyperplasia (NHP) are classified asbenign; samples exhibiting Hurthle cell carcinoma (HC), follicularcarcinoma (FC), follicular variant of papillary thyroid carcinoma(FVPTC), papillary thyroid carcinoma (PTC), medullary thyroid carcinoma(MTC), or anaplastic carcinoma (ATC) are classified as malignant.

Affymetrix software is used to extract, normalize, and summarizeintensity data from roughly 6.5 million probes. Approximately 280,000core probe sets are subsequently used in feature selection andclassification. The models used are LIMMA for feature selection andrandom forest and support vector machine (SVM) for classification.Iterative rounds of training, classification, and cross validation areperformed using random subsets of data. Top features are identified intwo separate analyses (malignant vs. benign and MTC vs. rest) using theclassification engine described above.

Markers are selected based on significance after Benjamini and Hochbergcorrection for false data discovery rate (FDR). An FDR filter of p<0.05is used.

A malignant vs. benign comparison of thyroid tissue samples finds 413markers that are diagnostic for thyroid diseases or conditions. The top100 markers are listed in FIG. 9.

An MTC vs. the rest (i.e. non-MTC) comparison of thyroid tissue samplesfinds 671 markers that are diagnostic for thyroid diseases orconditions. The top 100 markers are listed in FIG. 10.

Example 3: Meta-Analysis of Gene Expression Product Data from ThyroidSamples

Surgical thyroid tissue samples (FIG. 7) and thyroid samples obtainedvia fine needle aspiration (FIG. 8) are identified as benign ormalignant by pathological examination and then examined by hybridizationto an Affymetrix HumanExon10ST array. A meta-analysis approach isutilized which allows the identification of genes with repeatablefeatures in each classification. Affymetrix software is used to extract,normalize, and summarize intensity data from approximately 6.5 millionprobes. Roughly 280,000 probe sets are used for feature selection andclassification. LIMMA is used for feature selection. Classification isperformed with random forest and SVM methods. Markers that repeatedlyappear in multiple iterative rounds of training, classification, andcross validation of the surgical and fine needle aspirate samples areidentified and ranked. A joint set of core features are created usingthe top ranked features that appear for both the surgical and fineneedle aspirate data. Markers with a non-zero repeatability score areselected as significant. A total of 102 markers are found to besignificant and are listed in FIG. 11.

Example 4: Bayesian Analysis of Gene Expression Product Data fromThyroid Samples

Two groups of well-characterized samples are compared in order toidentify genes that distinguish benign from malignant nodules in thehuman thyroid. Samples are derived from surgical thyroid tissue (tissue;n=205, FIG. 7) or from fine needle aspirates (FNA; n=74, FIG. 8) and areexamined by hybridization to the HumanExon10ST microarray. Pathologylabels for each distinct thyroid subtype are coded as either benign (B)or malignant (M). A total of 499 markers that show distinct differentialexpression between benign and malignant samples are identified.

Affymetrix software is used to extract, normalize, and summarizeintensity data from approximately 6.5 million probes. Roughly 280,000core probe sets are subsequently used in feature selection andclassification of ˜22,000 genes. The models used are LIMMA (for featureselection) and SVM (for classification) respectively.

Next, we previously published molecular profile studies are examined inorder to derive the type I and type II error rates of assigning a geneinto the “benign” or “malignant” category. The error rates arecalculated based on the sample size reported in each particularpublished study with an estimated fold-change value of two. Lastly,these prior probabilities are combined with the output of the Tissuedataset to estimate the posterior probability of differential geneexpression, and then combined with the FNA dataset to formulate thefinal posterior probabilities of differential expression (Smyth 2004).These posterior probabilities are used to rank the genes and those thatexceed a posterior probability threshold of 0.9 are selected. A total of499 markers are identified as significant and the top 100 are listed inFIG. 12.

Example 5: Subtype Analysis of Gene Expression Product Data from ThyroidSamples

Well-characterized samples are examined in order to distinguish benignnodules from those with distinct pathology in the human thyroid. 205hybridizations to the HumanExon10ST microarray are examined. Pathologylabels for each distinct thyroid subtype are used to systematicallycompare one group versus another. A total of 250 mRNA markers thatseparate thyroid into a wide range of pathology subtypes are identified.

A total of 205 thyroid tissue samples are examined with the AffymetrixHumanExon10ST array chip to identify genes that differ significantly inmRNA expression between distinct thyroid pathology subtypes (FIG. 7).Samples classified according to post-surgical thyroid pathology as:follicular adenoma (FA, n=22), lymphocytic thyroiditis (LCT, n=39),nodular hyperplasia (NHP, n=24)), are all collectively classified asbenign (n=85). In contrast, samples classified as Hurthle cell carcinoma(HC, n=27), follicular carcinoma (FC, n=19), follicular variant ofpapillary thyroid carcinoma (FVPTC, n=21), papillary thyroid carcinoma(PTC, n=26), medullary thyroid carcinoma (MTC, n=22), and anaplasticcarcinoma (ATC, n=5) are all collectively classified as malignant(n=120).

Affymetrix software is used to extract, normalize, and summarizeintensity data from roughly 6.5 million probes. Approximately 280,000core probe sets are subsequently used in feature selection andclassification. A given benign subtype (e.g., NHP) set is comparedagainst a pool of all other malignant subtypes (e.g., NHP vs. M) nextthe benign subset is compared again against each set of malignantsubtypes (NHP vs. FC, NHP vs. PTC, etc). The models used in theclassification engine are LIMMA (for feature selection), and randomforest and SVM are used for classification. Iterative rounds oftraining, classification, and cross-validation are performed usingrandom subsets of data. A joint core-set of genes that separate distinctthyroid subtypes is created.

Markers are selected based on the set of genes that optimizes theclassifier after pair-wise classification. A total of 251 markersmapping to 250 distinct genes allow the separation of 1-3 distinctthyroid subtypes (FIG. 13).

Example 6: Differentially Expressed miRNAs Identified Via the Agilent VsmicroRNA Array

Thyroid samples are hybridized to the Agilent Human v2 microRNA (miRNA)array. This array contains probes to 723 human and 76 viral miRNAs, andthese are targeted using ˜15,000 probesets. A comparison between benign(B) and malignant (M) thyroid samples is performed to identifysignificant differentially expressed miRNAs. All samples are derivedfrom clinical fine needle aspirates (n=89, FIG. 14).

Array intensity data is extracted, normalized, and summarized, followedby modeling using classification engine. Briefly, the models used areLIMMA (for feature selection), and random forest and support vectormachine (SVM) are used for classification. Iterative rounds of training,classification, and cross-validation are performed using random subsetsof data. Although several miRNAs are differentially expressed inmalignant as compared to benign (FIG. 16), no stand-alone classifierswere identified with this approach.

Example 7 Differentially Expressed miRNAs that are Diagnostic forThyroid Diseases

Thyroid nodule samples are hybridized to the Illumina Human v2 miRNAarray. This array contains probes to 1146 human miRNAs. A comparisonbetween benign and malignant thyroid samples is performed to identifysignificant differentially expressed miRNAs. All samples are derivedfrom clinical FNAs (n=24, FIG. 15).

Array intensity data is extracted, normalized, and summarized, followedby modeling using a classification engine. Briefly, the models used areLIMMA (for feature selection), and random forest, and support vectormachine (SVM) for classification. An additional “hot probes” method isadded to the classification engine, which in part incorporates ameta-analysis approach to the algorithm. Iterative rounds of training,classification, and cross-validation are performed using random subsetsof data. The “hot probes” method identifies probes that appear in everyloop of cross-validation, thereby creating a set of robust, repeatablefeatures. Markers are selected based on the p-value (P) of a comparisonbetween malignant and benign samples. A total of 145 miRNAs areidentified whose differential expression is identified as diagnostic forbenign or malignant thyroid conditions (FIG. 17).

Example 8: An Exemplary Device for Molecular Profiling

The molecular profiling business of the present invention compiles thelist of 4918 genes of FIG. 6 that are differentially expressed,differentially spliced or both between benign and malignant, benign andnormal, or malignant and normal samples at either the probe-set or thegene level. A subset of the 4918 genes are chosen for use in thediagnosis of biological samples by the molecular profiling business.Compositions of short (i.e. 12-25 nucleotide long) oligonucleotidescomplimentary to the subset of 4918 genes chose for use by the molecularprofiling business are synthesized by standard methods known in the artand immobilized on a solid support such as nitrocellulose, glass, apolymer, or a chip at known positions on the solid support.

Example 9: Molecular Profiling of a Biological Sample

A biological sample is obtained by fine needle aspiration and stored intwo aliquots, one for molecular profiling and one for cytologicalanalysis. The aliquot of biological sample for molecular profiling isadded to lysis buffer and triturated which results in lysing of thecells of the biological sample. Lysis buffer is prepared as follows: For1 ml of cDNA lysis buffer, the following were mixed together on ice: 0.2ml of Moloney murine leukemia virus (MMLV) reverse transcriptase, 5×(Gibco-BRL), 0.76 ml of H₂O (RNAse, DNAse free, Specialty Media), 5 μlof Nonidet P40 (USB), 10 μl of PrimeRNase inhibitor (3′5′ Incorporated),10 μl of RNAguard (Pharmacia), and 20 μl of freshly made, 1/24 dilutionof stock primer mix. The stock primer mix, kept aliquoted at −20° C.,includes 10 μl each of 100 mM dATP, dCTP, dGTP, dTTP solutions (12.5 mMfinal)(Boehringer); 10 μl of 50 OD/ml pd(T) 19-24 (Pharmacia); and 3 0μl H2O.

Cell RNA is then primed with an oligo dT primer. Reverse transcriptionwith reverse transcriptase is then performed in limiting conditions oftime and reagents to facilitate incomplete extension and to prepareshort cDNA of between about 500 bp to about 1000 bp. The cDNA is thentailed at the 5′ end with multiple dATP using polyA (dATP) and terminaltransferase.

The cDNA is then amplified with PCR reagents using a 60mer primer having24(dT) at the 3′ end. PCR cycling is performed at 94° C. for 1 minute,then 42° C. for 2 minutes and then 72° C. for 6 minutes with 10 secondextension times at each cycle. 10 cycles are performed. Then additionalTaq polymerase is added and an additional 25 cycles are performed.

cDNA is extracted in phenol-chloroform, precipitated with ethanol andthen half of the sample is frozen at −80° C. as a stock to avoid thawingand freezing the entire amount of cDNA while analyzing it.

5 μg of PCR product is combined with 15.5 μl EF sln (Tris in Qiagen kitPCR purification), 4 μl of lox One-Phor-All buffer from Promega, and 0.5units of DNase I. The total volume is then held at 37° C. for 14minutes, then held at 99° C. for 15 minutes and then put on ice for 5minutes to fragment the PCR product into segments about 50 bp to about100 bp in length. The fragments are then end-labeled by combining thetotal volume with 1 μl of Biotin-N6-ddATP (“NEN”) and 1.5 μl of TdT(terminal transferase) (15unit/μl). The total volume is then held at 37°C. for 1 hour, then held at 99° C. for 15 minutes and then held on icefor 5 minutes.

The labeled and fragmented cDNA is hybridized with the probeset of thepresent invention in 200 microliters of hybridization solutioncontaining 5-10 microgram labeled target in 1×MES buffer (0.1 M MES, 1.0M NaCl, 0.01% Triton X-100, pH 6.7) and 0.1 mg/ml herring sperm DNA. Thearrays used are Affymetrix Human Exon 10ST arrays. The arrays are placedon a rotisserie and rotated at 60 rpm for 16 hours at 45° C. Followinghybridization, the arrays are washed with 6×SSPE-T (0.9 M NaCl, 60 mMNaH2PO4, 6 mM EDTA, 0.005% Triton X-100, pH 7.6) at 22° C. on a fluidicsstation (Affymetrix) for 10×2 cycles, and then washed with 0.1 MES at45° C. for 30 min. The arrays are then stained with astreptavidin-phycoerythrin conjugate (Molecular Probes), followed by6×SSPE-T wash on the fluidics station for 10×2 cycles again. To enhancethe signals, the arrays are further stained with Anti-streptavidinantibody for 30 min followed by a 15 min staining with astreptavidin-phycoerythrin conjugate again. After 6×SSPE-T wash on thefluidics station for 10×2 cycles, the arrays are scanned at a resolutionof 3 microns using a modified confocal scanner to determine rawfluorescence intensity values at each position in the array,corresponding to gene expression levels for the sequence at that arrayposition.

The raw fluorescence intensity values are converted to gene expressionproduct levels, normalized via the RMA method, filtered to remove datathat may be considered suspect, and input to a pre-classifier algorithmwhich corrects the gene expression product levels for the cell-typecomposition of the biological sample. The corrected gene expressionproduct levels are input to a trained algorithm for classifying thebiological sample as benign, malignant, or normal. The trained algorithmprovides a record of its output including a diagnosis, and a confidencelevel.

Example 10: Molecular Profiling of Thyroid Nodule

An individual notices a lump on his thyroid. The individual consults hisfamily physician. The family physician decides to obtain a sample fromthe lump and subject it to molecular profiling analysis. Said physicianuses a kit from the molecular profiling business to obtain the samplevia fine needle aspiration, perform an adequacy test, store the samplein a liquid based cytology solution, and send it to the molecularprofiling business. The molecular profiling business divides the samplefor cytological analysis of one part and for the remainder of the sampleextracts mRNA from the sample, analyzes the quality and suitability ofthe mRNA sample extracted, and analyses the expression levels andalternative exon usage of a subset of the genes listed in FIG. 5. Inthis case, the particular gene expression products profiled isdetermined by the sample type, by the preliminary diagnosis of thephysician, and by the molecular profiling company.

The molecular profiling business analyses the data and provides aresulting diagnosis to the individual's physician as illustrated in FIG.20. The results provide 1) a list of gene expression products profiled,2) the results of the profiling (e.g. the expression level normalized toan internal standard such as total mRNA or the expression of a wellcharacterized gene product such as tubulin, 3) the gene productexpression level expected for normal tissue of matching type, and 4) adiagnosis and recommended treatment for Bob based on the gene productexpression levels. The molecular profiling business bills theindividual's insurance provider for products and services rendered.

Example 11: Molecular Profiling as an Adjunct to Cytological Examination

An individual notices a suspicious lump on her thyroid. The individualconsults her primary care physician who examines the individual andrefers her to an endocrinologist. The endocrinologist obtains a samplevia fine needle aspiration, and sends the sample to a cytologicaltesting laboratory. The cytological testing laboratory performs routinecytological testing on a portion of the fine needle aspirate, theresults of which are ambiguous (i.e. indeterminate). The cytologicaltesting laboratory suggests to the endocrinologist that the remainingsample may be suitable for molecular profiling, and the endocrinologistagrees.

The remaining sample is analyzed using the methods and compositionsherein. The results of the molecular profiling analysis suggest a highprobability of early stage follicular cell carcinoma. The resultsfurther suggest that molecular profiling analysis combined with patientdata including patient age, and lump or nodule size indicatesthyroidectomy followed by radioactive iodine ablation. Theendocrinologist reviews the results and prescribes the recommendedtherapy.

The cytological testing laboratory bills the endocrinologist for routinecytological tests and for the molecular profiling. The endocrinologistremits payment to the cytological testing laboratory and bills theindividual's insurance provider for all products and services rendered.The cytological testing laboratory passes on payment for molecularprofiling to the molecular profiling business and withholds a smalldifferential.

Example 12: Molecular Profiling Performed by a Third Party

An individual complains to her physician about a suspicious lump on herneck. The physician examines the individual, and prescribes a molecularprofiling test and a follow up examination pending the results. Theindividual visits a clinical testing laboratory also known as a CLIAlab. The CLIA lab is licensed to perform molecular profiling of thecurrent invention. The individual provides a sample at the CLIA lab viafine needle aspiration, and the sample is analyzed using the molecularprofiling methods and compositions herein. The results of the molecularprofiling are electronically communicated to the individual's physician,and the individual is contacted to schedule a follow up examination. Thephysician presents the results of the molecular profiling to theindividual and prescribes a therapy.

Example 13: Overlapping Genes Using Different Analysis Methods

The results described in Example 2 were obtained by examining surgicalthyroid nodule tissue samples and comparing gene expression in malignantversus benign (“malignant vs. benign” data set). This analysisidentified 412 genes that are differentially expressed (FDR p<0.05). Ina previous study described in Example 1, using i) a different cohort ofsamples and ii) a different analysis method, we describe 4918 genes thatcan distinguish between malignant and benign thyroid nodules (“4918”).The “malignant vs. benign” tissue discovery dataset shares 231/412 geneswith the “4918” discovery dataset, while 181/412 genes have been newlydiscovered.

A similar comparison between medullary thyroid cancer (MTC) and the“Rest” of the thyroid subtypes using the tissue cohort pointed to 668significant genes that are differentially expressed between these twogroups (FIG. 10). When cross-checked against our previous “4918” genelist, we note that 305/668 genes had been previously described, while363/668 genes have been newly discovered.

We next combined the surgical tissue dataset with a fine needle aspirate(FNA) dataset and once again compared malignant versus benign using i) a“hot probes” and ii) a “Bayes” approach. Each analysis identified 102and 498 significant genes, respectively (Tables 11 and 12).

Up until this point a total of 1343 significant genes were identified.However, a subsequent subset analysis aimed at identifying those genesthat separate distinct pathology subtypes from one another was alsoperformed. This analysis used the surgical tissue cohort and resulted in250 significant genes (FIG. 13).

In sum, the five comparisons described here give rise to 1437significant genes. Of these, 636/1437 genes are described for the firsttime as distinguishing malignant versus benign thyroid pathology. As oftoday, 568/636 have not yet been described in published scientificliterature or patent applications as diagnostic markers of thyroidcancer.

Example 14: Clinical Thyroid FNA

Methods

Prospective clinical thyroid FNA samples were examined with theAffymetrix Human Exon 1.0ST microarray in order to identify genes thatdiffer significantly in mRNA expression between benign and malignantsamples.

Affymetrix software was used to extract, normalize, and summarizeintensity data from roughly 6.5 million probes. Approximately 280,000core probe sets were subsequently used in feature selection andclassification. The models used were LIMMA (for feature selection),random forest and SVM were used for classification (Smyth 2004;Diaz-Uriarte and Alvarez de Andres 2006). Iterative rounds of training,classification, and cross-validation were performed using random subsetsof data. Top features were identified in three separate analyses usingthe classification engine described above.

While the annotation and mapping of genes to transcript clusteridentifiers (TCID) is constantly evolving, the nucleotide sequences inthe probesets that make up a TCID do not change. Furthermore, a numberof significant TCIDs do not map any known genes, yet these are equallyimportant biomarkers in the classification of thyroid malignancy.Results are described using both the TCID and the genes currently mappedto each (Affymetrix annotation file:HuEx-1_0-st-v2.na29.hg18.transcript.csv).

Results

The study of differential gene expression in prospectively collected,clinical thyroid FNA required a number of statistical sub-analyses.These sub-analyses alone resulted in the discovery of genes that arevaluable in the classification of thyroid nodules of unknown pathology.However, the joining of the datasets has resulted in the novelcharacterization of thyroid gene panels, which can correctly classifythyroid FNA with improved accuracy over current cytopathology, andmolecular profiling methods.

TABLE 3 Top Benign vs. Malignant Analysis. This analysis resulted in 175unique TCIDs, currently mapping to 198 genes. FDR Gene Symbol (AffyLIMMA p- TCID v.na29) value Fold Change 2884845 GABRB2 2.85E−35 3.222400177 CAMK2N1 8.23E−30 2.50 3638204 MFGE8 2.16E−29 1.75 3638204 QTRT12.16E−29 1.75 2708855 C11orf72 4.11E−27 2.27 2708855 LIPH 4.11E−27 2.273415744 IGFBP6 5.44E−27 1.81 3136178 PLAG1 1.64E−26 1.76 2657808 CLDN163.63E−26 3.01 3451375 PRICKLE1 3.63E−26 1.78 2442008 RXRG 7.62E−26 2.173329343 MDK 3.60E−24 1.34 3666366 CDH3 3.60E−24 1.25 3757108 KRT191.06E−23 1.44 3040518 MACC1 1.14E−23 1.73 3988596 ZCCHC12 2.14E−23 2.223416895 METTL7B 2.90E−23 1.33 2721959 ROS1 6.26E−23 3.05 2721959 SLC34A26.26E−23 3.05 3125116 DLC1 9.12E−23 0.82 2828441 PDLIM4 9.51E−23 0.812783596 PDE5A 1.60E−22 1.93 3645555 TNFRSF12A 1.71E−22 1.25 3973891CXorf27 1.75E−22 1.38 3973891 SYTL5 1.75E−22 1.38 2827645 SLC27A62.02E−22 2.28 3020343 MET 2.02E−22 2.25 3452478 AMIGO2 2.03E−22 1.172451931 GOLT1A 2.15E−22 0.84 3679959 EMP2 3.81E−22 1.51 3417249 ERBB31.11E−21 1.05 3087167 TUSC3 1.16E−21 1.90 2924492 HEY2 1.38E−21 1.382685304 PROS1 1.48E−21 2.15 3335894 CST6 1.50E−21 2.50 3393720 MPZL21.52E−21 1.86 3907234 SDC4 1.60E−21 1.64 4012178 CITED1 4.03E−21 2.422994981 PRR15 5.89E−21 0.94 2973232 C6orf174 6.09E−21 1.07 2973232KIAA0408 6.09E−21 1.07 2809245 ITGA2 6.13E−21 1.84 3067478 NRCAM9.01E−21 1.70 3420316 HMGA2 1.13E−20 0.94 4018327 TRPC5 1.14E−20 1.783416921 RDH5 1.24E−20 0.55 2333318 PTPRF 1.42E−20 0.78 3336486 C11orf801.71E−20 0.58 3336486 RCE1 1.71E−20 0.58 3044072 NOD1 3.06E−20 1.013417809 NAB2 3.40E−20 0.57 2710599 CLDN1 4.47E−20 2.53 3343452 FZD44.93E−20 1.49 3343452 PRSS23 4.93E−20 1.49 2720584 SLIT2 6.84E−20 1.453389976 SLC35F2 1.16E−19 0.94 3587495 SCG5 1.45E−19 1.60 3744463 MYH101.58E−19 1.40 3987607 CCDC121 1.87E−19 1.56 3987607 ZCCHC16 1.87E−191.56 3984945 ARMCX3 3.69E−19 1.11 2558612 TGFA 9.18E−19 0.89 3522398AIDA 1.02E−18 1.33 3522398 DOCK9 1.02E−18 1.33 2781736 CFI 1.04E−18 1.913338192 CCND1 1.09E−18 1.25 3338192 FLJ42258 1.09E−18 1.25 2414958TACSTD2 1.12E−18 0.91 2991860 ITGB8 1.51E−18 1.30 2805078 CDH6 1.64E−181.58 3976341 TIMP1 1.98E−18 1.68 2562435 EDNRB 1.98E−18 1.61 2562435SFTPB 1.98E−18 1.61 3726154 ITGA3 2.04E−18 1.17 2381249 C1orf1154.38E−18 0.92 2356818 BCL9 6.05E−18 0.63 3451814 MAFG 7.13E−18 1.923451814 NELL2 7.13E−18 1.92 3445908 EPS8 7.19E−18 1.60 2451870 ETNK28.68E−18 1.00 3201345 LOC554202 1.08E−17 1.05 3581221 AHNAK2 1.14E−171.28 2966193 C6orf168 1.23E−17 0.85 2876608 CXCL14 1.85E−17 1.76 3129065CLU 1.85E−17 1.37 3222170 TNC 1.94E−17 1.24 2438458 CRABP2 2.16E−17 1.242600689 EPHA4 2.17E−17 1.51 3763390 TMEM100 2.61E−17 1.34 2902958 C4A3.56E−17 1.36 2902958 C4B 3.56E−17 1.36 2952834 KCNK5 6.07E−17 0.512452478 LEMD1 9.66E−17 1.27 3751002 RAB34 1.14E−16 0.83 3489138 CYSLTR21.72E−16 1.61 2417362 DIRAS3 1.72E−16 1.15 2370123 XPR1 1.81E−16 0.892680046 ADAMTS9 1.83E−16 1.40 3494629 SCEL 2.04E−16 1.61 3040967 RAPGEF52.04E−16 0.92 3554452 KIAA0284 2.33E−16 0.59 4020655 ODZ1 2.44E−16 1.972400518 ECE1 3.31E−16 0.98 2598261 FN1 3.58E−16 2.41 3187686 GSN4.03E−16 0.78 2742224 SPRY1 3.51E−15 1.18 3628832 DAPK2 4.59E−15 1.173408831 SSPN 4.69E−15 0.99 3925639 NRIP1 5.01E−15 1.02 3683377 GPRC5B5.39E−15 1.10 2397025 DHRS3 5.83E−15 1.14 2816298 IQGAP2 6.56E−15 −1.043848039 C3 7.85E−15 1.62 3367673 MPPED2 7.93E−15 −1.71 2822215 PAM8.70E−15 1.08 2567167 LONRF2 1.12E−14 1.40 2522094 SPATS2L 2.21E−14 0.963898355 FLRT3 2.70E−14 1.96 3717870 TMEM98 2.72E−14 1.51 3212008 FRMD33.50E−14 1.43 2597867 IKZF2 3.58E−14 0.91 3007960 CLDN4 6.44E−14 1.272468811 ASAP2 7.11E−14 0.89 3046197 ELMO1 8.04E−14 −1.10 3132616 ZMAT48.04E−14 −1.29 3181600 GALNT12 8.25E−14 0.74 3095313 C8orf4 8.38E−141.28 2525533 LOC648149 8.38E−14 1.01 2525533 MAP2 8.38E−14 1.01 3464860DUSP6 9.39E−14 1.10 3464860 LOC100131490 9.39E−14 1.10 2751936 GALNT71.52E−13 0.93 2578790 LRP1B 1.65E−13 −1.33 2700365 TM4SF1 2.19E−13 1.602598828 IGFBP5 2.87E−13 1.67 3126191 PSD3 3.12E−13 1.34 3979101 FAAH23.88E−13 0.68 3577612 SERPINA1 3.99E−13 1.12 3577612 SERPINA2 3.99E−131.12 3622934 MYEF2 4.25E−13 0.92 3622934 SLC24A5 4.25E−13 0.92 2738664SGMS2 4.47E−13 1.13 3692999 MT1G 4.65E−13 −2.43 2902844 C2 7.40E−13 1.362902844 CFB 7.40E−13 1.36 3662201 MT1F 8.84E−13 −1.87 3662201 MT1H8.84E−13 −1.87 3662201 MT1P2 8.84E−13 −1.87 2617188 ITGA9 1.07E−12 1.053401704 CCND2 1.09E−12 0.86 2562529 ST3GAL5 1.34E−12 0.88 2371139 LAMC21.53E−12 0.99 2626802 PTPRG 1.83E−12 1.06 2834282 STK32A 2.53E−12 1.232526806 FN1 3.12E−12 1.84 3111561 MAPK6 3.66E−12 −2.04 3111561 PKHD1L13.66E−12 −2.04 3238962 KIAA1217 7.24E−12 1.21 3238962 PRINS 7.24E−121.21 3110608 TM7SF4 7.72E−12 1.92 2466554 TPO 1.14E−11 −1.78 3126368PSD3 2.30E−11 1.39 3558418 STXBP6 3.35E−11 0.94 2980449 IPCEF1 3.42E−11−1.05 3907190 SLPI 4.25E−11 1.61 2955932 GPR110 5.17E−11 1.29 2976360PERP 7.31E−11 1.31 2686023 DCBLD2 8.03E−11 0.98 2915828 NT5E 9.40E−111.19 3219621 CTNNAL1 1.17E−10 1.01 3971451 PHEX 1.39E−10 1.53 3417583RBMS2 1.39E−10 1.09 2424102 CNN3 1.58E−10 1.07 3369931 RAG2 2.12E−10−1.41 2730746 SLC4A4 2.24E−10 −1.21 3010503 CD36 2.91E−10 −1.42 3446137LMO3 3.09E−10 1.44 3933536 TFF3 3.09E−10 −1.10 4021777 IGSF1 3.11E−101.55 3467949 SLC5A8 4.08E−10 −1.34 3288518 C10orf72 4.26E−10 1.182336891 DIO1 4.31E−10 −1.73 2498274 C2orf40 4.39E−10 1.71 2740067 ANK25.52E−10 −0.90 2924330 TPD52L1 6.04E−10 1.09 2427469 SLC16A4 6.71E−101.37 2727587 KIT 1.23E−09 −1.24 3464417 MGAT4C 1.45E−09 1.26 2331558BMP8A 3.61E−09 −1.55 2711205 ATP13A4 6.51E−09 1.15 3142381 FABP47.25E−09 −1.59 3743551 CLDN7 8.01E−09 1.13 3662150 MT1M 8.06E−09 −1.473662150 MT1P3 8.06E−09 −1.47 3166644 TMEM215 9.05E−09 1.51 3087659SLC7A2 1.32E−08 1.28 3321055 TEAD1 1.37E−07 1.10 3059667 SEMA3D 1.43E−07−1.83

TABLE 4 Top Subtype Analysis This analysis resulted in 599 unique TCIDs,currently mapping to 681 genes. Gene Symbol TCID (Affy vna29) Subtype 1Subtype 2 Subtype 3 Subtype 4 3153400 3153400 NHP_PTC 3749600 3749600NHP_PTC 3726691 ABCC3 FA_FVPTC 3368940 ABTB2 NHP_PTC 3279058 ACBD7NHP_PTC 2796553 ACSL1 NHP_PTC 3299504 ACTA2 NHP_PTC 3927480 ADAMTS5NHP_PTC 2680046 ADAMTS9 NHP_PTC FA_FVPTC NHP_FVPTC LCT_REST 3252170 ADKNHP_PTC 3039791 AGR2 NHP_PTC 3581221 AHNAK2 NHP_PTC 2991233 AHR NHP_PTC3522398 AIDA NHP_PTC NHP_FVPTC 3226138 AK1 NHP_PTC 3233049 AKR1C3NHP_FVPTC 4009849 ALAS2 NHP_PTC 3611625 ALDH1A3 NHP_PTC 3169331 ALDH1B1FA_FVPTC 3571727 ALDH6A1 FA_FVPTC 3452478 AMIGO2 NHP_PTC 4018454 AMOTNHP_PTC 2740067 ANK2 NHP_PTC NHP_FVPTC 3323748 ANO5 FA_FVPTC NHP_FVPTC3174816 ANXA1 NHP_PTC 2732844 ANXA3 NHP_PTC 2881747 ANXA6 NHP_FVPTC3046062 AOAH NHP_PTC 2455418 AP3S1 NHP_PTC 4002809 APOO FA_FVPTCNHP_FVPTC 3595594 AQP9 NHP_PTC 2734421 ARHGAP24 NHP_PTC 2632453 ARL13BNHP_PTC 2931391 ARL4A NHP_PTC 3984945 ARMCX3 NHP_PTC 4015838 ARMCX6NHP_PTC 3321150 ARNTL NHP_PTC 3768474 ARSG NHP_FVPTC 2468811 ASAP2NHP_PTC 2526759 ATIC NHP_PTC 2711225 ATP13A4 NHP_PTC 2711205 ATP13A4NHP_PTC 3105749 ATP6V0D2 NHP_FVPTC 3824596 B3GNT3 NHP_PTC 2356818 BCL9NHP_PTC 2608725 BHLHE40 NHP_PTC 3448088 BHLHE41 NHP_PTC 3772187 BIRC5LCT_REST 2331558 BMP8A NHP_PTC NHP_FVPTC 3926080 BTG3 NHP_PTC 3288518C10orf72 FA_FVPTC NHP_FVPTC 2708855 C11orf72 NHP_PTC FA_FVPTC NHP_FVPTC3327166 C11orf74 FA_FVPTC NHP_FVPTC 3336486 C11orf80 NHP_PTC 3473331C12orf49 NHP_PTC 3571727 C14orf45 FA_FVPTC 3649714 C16orf45 NHP_PTC3832280 C19orf33 NHP_PTC 2381249 C1orf115 NHP_PTC 2453065 C1orf116NHP_PTC 2902844 C2 NHP_PTC 3963676 C22orf9 NHP_FVPTC 2498274 C2orf40NHP_PTC FA_FVPTC NHP_FVPTC 3848039 C3 NHP_PTC 2902958 C4A NHP_PTCFA_FVPTC NHP_FVPTC 2902958 C4B NHP_PTC FA_FVPTC NHP_FVPTC 2766492C4orf34 NHP_PTC 2730303 C4orf7 LCT_REST 2855578 C5orf28 FA_FVPTC 2966193C6orf168 NHP_PTC 2973232 C6orf174 NHP_PTC FA_FVPTC 3060450 C7orf62NHP_PTC 3095313 C8orf4 NHP_PTC 3086809 C8orf79 FA_FVPTC 3867264 CA11NHP_PTC 3392332 CADM1 NHP_PTC 2400177 CAMK2N1 NHP_PTC FA_FVPTC NHP_FVPTC3420713 CAND1 NHP_PTC 3020302 CAV1 NHP_PTC 3020273 CAV2 NHP_PTC 3987607CCDC121 NHP_PTC FA_FVPTC NHP_FVPTC 2582701 CCDC148 NHP_PTC 2688813CCDC80 NHP_PTC 3718204 CCL13 NHP_PTC 3204285 CCL19 LCT_REST 3338192CCND1 NHP_PTC FA_FVPTC NHP_FVPTC 3380065 CCND1 NHP_FVPTC 3401704 CCND2NHP_PTC NHP_FVPTC 3316344 CD151 NHP_PTC 2860178 CD180 LCT_REST 2636125CD200 NHP_PTC 3010503 CD36 NHP_PTC NHP_FVPTC 3834502 CD79A LCT_REST2671728 CDCP1 NHP_PTC 3694657 CDH11 NHP_PTC 3666366 CDH3 NHP_PTCFA_FVPTC 2805078 CDH6 NHP_PTC 3417146 CDK2 NHP_PTC 2773719 CDKL2 NHP_PTC2871896 CDO1 NHP_PTC 4024373 CDR1 NHP_PTC 2902844 CFB NHP_PTC 2373336CFH NHP_PTC 2373336 CFHR1 NHP_PTC 2781736 CFI NHP_PTC 3920003 CHAF1BNHP_PTC 3442054 CHD4 NHP_PTC 4012178 CITED1 NHP_PTC FA_FVPTC NHP_FVPTC3178583 CKS2 NHP_PTC 3862108 CLC NHP_PTC 2710599 CLDN1 NHP_PTC NHP_FVPTC3497195 CLDN10 NHP_PTC 2657808 CLDN16 NHP_PTC FA_FVPTC NHP_FVPTC 3007960CLDN4 NHP_PTC NHP_FVPTC 3743551 CLDN7 NHP_PTC 3443183 CLEC4E NHP_PTC3129065 CLU NHP_PTC FA_FVPTC 2424102 CNN3 NHP_PTC 3762198 COL1A1 NHP_PTC3761054 COPZ2 FA_FVPTC NHP_FVPTC 3106559 CP NHP_PTC 3105904 CPNE3FA_FVPTC 2377283 CR2 LCT_REST 3603295 CRABP1 NHP_PTC 2438458 CRABP2NHP_PTC FA_FVPTC 2406783 CSF3R NHP_PTC 3126504 CSGALNACT1 FA_FVPTC3335894 CST6 NHP_PTC FA_FVPTC NHP_FVPTC 3219621 CTNNAL1 NHP_PTC 2618940CTNNB1 NHP_PTC 3634811 CTSH NHP_PTC 3338552 CTTN NHP_PTC 2773434 CXCL1NHP_PTC 2732508 CXCL13 LCT_REST 2876608 CXCL14 NHP_PTC 3863640 CXCL17NHP_PTC 2773434 CXCL2 NHP_PTC 2773434 CXCL3 NHP_PTC 4024420 CXorf18NHP_PTC 3973891 CXorf27 NHP_PTC 3910429 CYP24A1 NHP_FVPTC 2528093CYP27A1 NHP_FVPTC 3489138 CYSLTR2 NHP_PTC FA_FVPTC NHP_FVPTC 3628832DAPK2 NHP_PTC FA_FVPTC NHP_FVPTC 2686023 DCBLD2 NHP_PTC 3683845 DCUN1D3NHP_PTC 2420832 DDAH1 NHP_PTC 3329649 DDB2 NHP_PTC 3754736 DDX52 NHP_PTC3487095 DGKH NHP_PTC 3074912 DGKI NHP_PTC 3558118 DHRS1 NHP_PTC 2397025DHRS3 NHP_PTC 2336891 DIO1 NHP_PTC FA_FVPTC NHP_FVPTC 2417362 DIRAS3NHP_PTC NHP_FVPTC 3125116 DLC1 NHP_PTC FA_FVPTC 3522398 DOCK9 NHP_PTCNHP_FVPTC 3913483 DPH3B FA_FVPTC 2584018 DPP4 NHP_PTC 2880292 DPYSL3NHP_PTC 3783529 DSG2 NHP_PTC 2893794 DSP NHP_PTC 2958325 DST NHP_PTC3622176 DUOX1 NHP_FVPTC 3622176 DUOX2 NHP_FVPTC 3622239 DUOXA1 NHP_FVPTC3622239 DUOXA2 NHP_FVPTC 3129731 DUSP4 NHP_PTC 3263743 DUSP5 NHP_PTC3464860 DUSP6 NHP_PTC 3497195 DZIP1 NHP_PTC 2400518 ECE1 NHP_PTCFA_FVPTC NHP_FVPTC 2562435 EDNRB NHP_PTC 3002640 EGFR NHP_PTC 2484970EHBP1 NHP_PTC 3837431 EHD2 NHP_PTC 3326461 EHF NHP_PTC 3544387 EIF2B2FA_FVPTC 3427098 ELK3 NHP_PTC 3046197 ELMO1 NHP_PTC 3679959 EMP2 NHP_PTCFA_FVPTC NHP_FVPTC 3852832 EMR3 NHP_PTC 2458338 ENAH NHP_PTC 3345427ENDOD1 NHP_PTC 2327677 EPB41 NHP_PTC 2600689 EPHA4 NHP_PTC 2346625 EPHX4NHP_PTC 3772187 EPR1 LCT_REST 3445908 EPS8 NHP_PTC 3720402 ERBB2FA_FVPTC 3417249 ERBB3 NHP_PTC 3683845 ERI2 NHP_PTC 2462329 ERO1LBFA_FVPTC 3445768 ERP27 NHP_PTC 2451870 ETNK2 NHP_PTC 3039177 ETV1NHP_PTC 2709132 ETV5 NHP_PTC 2863363 F2RL2 NHP_PTC 3979101 FAAH2 NHP_PTC3142381 FABP4 NHP_PTC NHP_FVPTC 3331926 FAM111A NHP_PTC 3331903 FAM111BNHP_PTC 3104323 FAM164A NHP_PTC 2560625 FAM176A NHP_PTC 3768535 FAM20ANHP_PTC 3143330 FAM82B FA_FVPTC 3152558 FAM84B NHP_PTC 2396750 FBXO2NHP_PTC 3473480 FBXO21 NHP_PTC 3229338 FCN1 NHP_PTC 3229338 FCN2 NHP_PTC2742109 FGF2 NHP_PTC 3413950 FGFR1OP2 NHP_PTC 3324447 FIBIN FA_FVPTC2738244 FLJ20184 NHP_PTC 3346147 FLJ32810 NHP_PTC 3338192 FLJ42258NHP_PTC FA_FVPTC NHP_FVPTC 3380065 FLJ42258 NHP_FVPTC 3898355 FLRT3NHP_PTC 2526806 FN1 NHP_PTC 2598261 FN1 NHP_PTC FA_FVPTC 3869237 FPR1NHP_PTC 3839910 FPR2 NHP_PTC 3212008 FRMD3 NHP_PTC FA_FVPTC NHP_FVPTC3393479 FXYD6 NHP_FVPTC 3343452 FZD4 NHP_PTC 3110272 FZD6 NHP_PTC2523045 FZD7 NHP_PTC 3217242 GABBR2 NHP_PTC 2884845 GABRB2 NHP_PTCFA_FVPTC NHP_FVPTC 2341083 GADD45A NHP_PTC 2401581 GALE NHP_PTC 3181600GALNT12 NHP_PTC 2585129 GALNT3 NHP_PTC 2751936 GALNT7 NHP_PTC 2684187GBE1 NHP_FVPTC 2421843 GBP1 NHP_PTC 2421843 GBP3 NHP_PTC 3044129 GGCTNHP_PTC 4015763 GLA NHP_FVPTC 3593931 GLDN NHP_PTC 2417272 GNG12 NHP_PTC2451931 GOLT1A NHP_PTC 2955932 GPR110 NHP_PTC 2955999 GPR110 NHP_PTC2819779 GPR98 NHP_PTC NHP_FVPTC 3683377 GPRC5B NHP_PTC 2827057 GRAMD3NHP_PTC 3187686 GSN NHP_PTC 2787958 GYPB NHP_PTC 2504328 GYPC NHP_PTC2787958 GYPE NHP_PTC 2809793 GZMK LCT_REST 3217077 HEMGN NHP_PTC 2924492HEY2 NHP_PTC FA_FVPTC NHP_FVPTC 2946194 HIST1H1A NHP_PTC 2946215HIST1H3B LCT_REST 2947081 HIST1H4L LCT_REST 2950125 HLA-DQB2 NHP_PTC3420316 HMGA2 NHP_PTC 3830065 HPN NHP_PTC 2658275 HRASLS FA_FVPTC3508330 HSPH1 NHP_PTC 3820443 ICAM1 NHP_PTC 2401493 ID3 FA_FVPTC 2708922IGF2BP2 FA_FVPTC 2598828 IGFBP5 NHP_PTC 3415744 IGFBP6 NHP_PTC FA_FVPTCNHP_FVPTC 4021777 IGSF1 NHP_PTC FA_FVPTC 2597867 IKZF2 FA_FVPTCNHP_FVPTC 3755862 IKZF3 FA_FVPTC 2497082 IL1RL1 NHP_PTC 3275729 IL2RANHP_FVPTC 2731332 IL8 NHP_FVPTC 2599303 IL8RA NHP_PTC 2599303 IL8RBNHP_PTC 2527580 IL8RB NHP_PTC 2599303 IL8RBP NHP_PTC 2527580 IL8RBPNHP_PTC 2673873 IMPDH2 NHP_PTC 3267382 INPP5F NHP_PTC 2980449 IPCEF1NHP_PTC 2816298 IQGAP2 NHP_PTC 2809245 ITGA2 NHP_PTC 3726154 ITGA3NHP_PTC 2617188 ITGA9 NHP_PTC FA_FVPTC 3852832 ITGB1 NHP_PTC 2583465ITGB6 NHP_PTC 2991860 ITGB8 NHP_PTC 4013549 ITM2A LCT_REST 2608469 ITPR1NHP_PTC 3556990 JUB NHP_PTC 3998766 KAL1 NHP_PTC 2628260 KBTBD8 LCT_REST2952834 KCNK5 NHP_PTC 3154002 KCNQ3 NHP_PTC 3383130 KCTD14 NHP_PTC2827525 KDELC1 NHP_PTC 3945314 KDELR3 NHP_PTC 2959039 KHDRBS2 NHP_PTC3554452 KIAA0284 NHP_PTC NHP_FVPTC 2973232 KIAA0408 NHP_PTC FA_FVPTC3238962 KIAA1217 NHP_PTC NHP_FVPTC 3529951 KIAA1305 FA_FVPTC 2727587 KITNHP_PTC FA_FVPTC NHP_FVPTC 3978943 KLF8 NHP_PTC 2708066 KLHL6 LCT_REST3868828 KLK10 NHP_PTC 3868783 KLK7 NHP_PTC 3415576 KRT18 NHP_PTC 3757108KRT19 NHP_PTC FA_FVPTC 2453793 LAMB3 NHP_PTC 2371065 LAMC1 NHP_PTC2371139 LAMC2 NHP_PTC 2962026 LCA5 NHP_PTC 3653619 LCMT1 NHP_PTC 3190190LCN2 NHP_PTC 4024420 LDOC1 NHP_PTC 2452478 LEMD1 NHP_PTC 2854092 LIFRFA_FVPTC 3841545 LILRA1 NHP_PTC 3841545 LILRB1 NHP_PTC 3454331 LIMA1NHP_PTC 3202528 LINGO2 NHP_PTC 2708855 LIPH NHP_PTC FA_FVPTC NHP_FVPTC3446137 LMO3 NHP_PTC FA_FVPTC NHP_FVPTC 2345286 LMO4 NHP_PTC 3028011LOC100124692 NHP_PTC 3442054 LOC100127974 NHP_PTC 3765689 LOC100129112NHP_PTC 3759587 LOC100129115 NHP_PTC 2601414 LOC100129171 NHP_PTC2577482 LOC100129961 NHP_PTC 2504328 LOC100130248 NHP_PTC 3110272LOC100131102 NHP_PTC 3464860 LOC100131490 NHP_PTC 2364677 LOC100131938NHP_PTC 3922793 LOC100132338 NHP_PTC 3392332 LOC100132764 NHP_PTC3487095 LOC283508 NHP_PTC 3724698 LOC440434 NHP_PTC 3201345 LOC554202NHP_PTC 2455418 LOC643454 NHP_PTC 2525533 LOC648149 NHP_PTC 4015838LOC653354 NHP_PTC 3724698 LOC653498 NHP_PTC 2936857 LOC730031 NHP_PTC2567167 LONRF2 NHP_PTC LCT_REST 2872848 LOX NHP_PTC 3220384 LPAR1NHP_FVPTC 3442137 LPAR5 NHP_PTC 3088486 LPL NHP_PTC 2578790 LRP1BNHP_PTC NHP_FVPTC 3106559 LRRC69 NHP_PTC 2608309 LRRN1 NHP_PTC 3465248LUM NHP_PTC 3683845 LYRM1 NHP_PTC 3040518 MACC1 NHP_PTC FA_FVPTC 3451814MAFG NHP_PTC FA_FVPTC 3994710 MAMLD1 NHP_PTC 2525533 MAP2 NHP_PTC3111561 MAPK6 NHP_PTC NHP_FVPTC 3108526 MATN2 FA_FVPTC 2539607 MBOAT2NHP_PTC 3097152 MCM4 NHP_PTC 3063685 MCM7 NHP_PTC 3329343 MDK NHP_PTCFA_FVPTC 2962820 ME1 NHP_FVPTC 3765689 MED13 NHP_PTC 3020343 MET NHP_PTCFA_FVPTC NHP_FVPTC 3416895 METTL7B NHP_PTC FA_FVPTC NHP_FVPTC 3808096MEX3C NHP_PTC 3638204 MFGE8 NHP_PTC FA_FVPTC NHP_FVPTC 3028011 MGAMNHP_PTC 2890859 MGAT1 NHP_FVPTC 3464417 MGAT4C NHP_PTC 2658275 MGC2889FA_FVPTC 3406589 MGST1 NHP_PTC 3707759 MIS12 FA_FVPTC 2936857 MLLT4NHP_PTC 3143660 MMP16 NHP_PTC 3143643 MMP16 NHP_PTC 2362333 MNDA NHP_PTC4017212 MORC4 NHP_PTC 3367673 MPPED2 NHP_PTC 3393720 MPZL2 NHP_PTCFA_FVPTC NHP_FVPTC 2955025 MRPL14 NHP_PTC 3662201 MT1F NHP_PTC FA_FVPTC3692999 MT1G NHP_PTC NHP_FVPTC 3662201 MT1H NHP_PTC FA_FVPTC 3662150MT1M NHP_PTC 3662201 MT1P2 NHP_PTC FA_FVPTC 3662150 MT1P3 NHP_PTC2931391 MTHFD1L NHP_PTC 2437118 MUC1 NHP_PTC 3366903 MUC15 NHP_PTC3655723 MVP NHP_PTC 3997825 MXRA5 NHP_PTC 3622934 MYEF2 NHP_PTC 3744463MYH10 NHP_PTC 2520429 MYO1B NHP_PTC 3752709 MYO1D NHP_PTC 3624607 MYO5ANHP_FVPTC 2914070 MYO6 NHP_PTC 3417809 NAB2 NHP_PTC 3695268 NAE1 NHP_PTC3074912 NAG20 NHP_PTC 3323052 NAV2 FA_FVPTC NHP_FVPTC 3349293 NCAM1FA_FVPTC 2590736 NCKAP1 NHP_PTC 3495076 NDFIP2 NHP_PTC 3789947 NEDD4LNHP_PTC 3451814 NELL2 NHP_PTC FA_FVPTC 2343231 NEXN NHP_PTC 3456666 NFE2NHP_PTC 3199207 NFIB NHP_PTC 2325410 NIPAL3 NHP_PTC 3182957 NIPSNAP3AFA_FVPTC 3182957 NIPSNAP3B FA_FVPTC 3044072 NOD1 NHP_PTC FA_FVPTC3571904 NPC2 NHP_PTC 3724698 NPEPPS NHP_PTC 2370926 NPL NHP_FVPTC2792127 NPY1R NHP_PTC 3067478 NRCAM NHP_PTC NHP_FVPTC 3925639 NRIP1NHP_PTC 2524301 NRP2 NHP_PTC 2915828 NT5E NHP_PTC 3143330 NTAN1 FA_FVPTC3322251 NUCB2 FA_FVPTC NHP_FVPTC 2742109 NUDT6 NHP_PTC 3654699 NUPR1FA_FVPTC 2768654 OCIAD2 NHP_PTC 2375338 OCR1 NHP_PTC 4020655 ODZ1NHP_PTC FA_FVPTC NHP_FVPTC 3380065 ORAOV1 NHP_FVPTC 3801621 OSBPL1ANHP_FVPTC 3555461 OSGEP NHP_PTC 2807359 OSMR NHP_PTC 2701071 P2RY13NHP_PTC 2875193 P4HA2 NHP_PTC 2822215 PAM NHP_PTC 3256590 PAPSS2NHP_FVPTC 3505781 PARP4 NHP_PTC 3320865 PARVA NHP_PTC 2364677 PBX1NHP_PTC 3134922 PCMTD1 FA_FVPTC 2783596 PDE5A NHP_PTC NHP_FVPTC 3922793PDE9A NHP_PTC 3087703 PDGFRL NHP_PTC 3301218 PDLIM1 NHP_PTC 2828441PDLIM4 NHP_PTC 3411810 PDZRN4 NHP_PTC 3013255 PEG10 NHP_PTC 2976360 PERPNHP_PTC 3971451 PHEX NHP_PTC 3975893 PHF16 NHP_PTC 2635906 PHLDB2NHP_PTC 3127385 PHYHIP NHP_PTC 3811086 PIGN FA_FVPTC 3111561 PKHD1L1NHP_PTC NHP_FVPTC 2511820 PKP4 NHP_PTC 3376529 PLA2G16 NHP_PTC 2955827PLA2G7 NHP_FVPTC 2583374 PLA2R1 NHP_PTC 3136178 PLAG1 NHP_PTC FA_FVPTCNHP_FVPTC 3252036 PLAU NHP_PTC 3759587 PLCD3 NHP_PTC 2521574 PLCL1NHP_FVPTC 3867458 PLEKHA4 NHP_PTC 3407096 PLEKHA5 NHP_PTC 2858023 PLK2NHP_PTC 3987996 PLS3 NHP_PTC 3911217 PMEPA1 NHP_PTC 3061997 PON2 NHP_PTC2763550 PPARGC1A NHP_PTC 2773358 PPBP NHP_PTC 3678462 PPL NHP_PTC2931090 PPP1R14C NHP_PTC 3384270 PRCP NHP_FVPTC 3451375 PRICKLE1 NHP_PTC3238962 PRINS NHP_PTC NHP_FVPTC 2682271 PROK2 NHP_PTC 2685304 PROS1NHP_PTC 2994981 PRR15 NHP_PTC 3973692 PRRG1 NHP_PTC 3343452 PRSS23NHP_PTC 3175971 PSAT1 FA_FVPTC 3126368 PSD3 NHP_PTC 3126191 PSD3 NHP_PTC2455418 PTPN14 NHP_PTC 2333318 PTPRF NHP_PTC 2626802 PTPRG NHP_PTC2973376 PTPRK NHP_PTC 3757917 PTRF NHP_PTC 3134922 PXDNL FA_FVPTC3638204 QTRT1 NHP_PTC FA_FVPTC NHP_FVPTC 2361257 RAB25 NHP_PTC 3625271RAB27A NHP_PTC 2929699 RAB32 NHP_FVPTC 3751002 RAB34 NHP_PTC 3183757RAD23B NHP_PTC 3369931 RAG2 NHP_PTC FA_FVPTC 4001223 RAI2 NHP_PTC3040967 RAPGEF5 NHP_PTC 3456081 RARG NHP_PTC 2819044 RASA1 NHP_PTC3944210 RASD2 NHP_PTC 4000944 RBBP7 NHP_PTC 3781429 RBBP8 NHP_PTC3417583 RBMS2 NHP_PTC 3336486 RCE1 NHP_PTC 3416921 RDH5 NHP_PTC 2779335RG9MTD2 FA_FVPTC 2372812 RGS13 LCT_REST 2372719 RGS18 NHP_PTC 2372858RGS2 NHP_PTC 2384401 RHOU NHP_PTC 2580802 RND3 NHP_PTC 2721959 ROS1NHP_PTC FA_FVPTC NHP_FVPTC 2709606 RPL39L NHP_PTC 3804143 RPRD1A NHP_PTC3867965 RRAS NHP_PTC 2469252 RRM2 LCT_REST 2442008 RXRG NHP_PTC FA_FVPTCNHP_FVPTC 2435981 S100A12 NHP_PTC 4045665 S100A14 NHP_PTC 4045643S100A16 NHP_PTC 2435989 S100A8 NHP_PTC 2359664 S100A9 NHP_PTC 3691326SALL1 NHP_PTC NHP_FVPTC 3564027 SAV1 NHP_PTC 2750594 SC4MOL NHP_PTC3091475 SCARA3 NHP_PTC 3442054 SCARNA11 NHP_PTC 3494629 SCEL NHP_PTC3587495 SCG5 NHP_PTC NHP_FVPTC 3441885 SCNN1A NHP_PTC 3043895 SCRN1NHP_PTC 3907234 SDC4 NHP_PTC FA_FVPTC NHP_FVPTC 3779756 SEH1L NHP_PTC2443450 SELL NHP_PTC 3058759 SEMA3C NHP_FVPTC 3059667 SEMA3D NHP_PTC2732273 SEPT11 NHP_PTC 2328273 SERINC2 NHP_PTC 3577612 SERPINA1 NHP_PTCFA_FVPTC 3577612 SERPINA2 NHP_PTC FA_FVPTC 2601414 SERPINE2 NHP_PTC3331355 SERPING1 NHP_PTC 2326774 SFN NHP_PTC 2562435 SFTPB NHP_PTC2768981 SGCB NHP_PTC 3061805 SGCE NHP_PTC 2648535 SGEF NHP_PTC 2738664SGMS2 NHP_PTC 3088213 SH2D4A NHP_PTC 3304970 SH3PXD2A NHP_PTC 3894727SIRPA NHP_PTC 3894727 SIRPB1 NHP_PTC 3154263 SLA NHP_PTC 2827525 SLC12A2NHP_PTC 2427469 SLC16A4 NHP_PTC 3768412 SLC16A6 NHP_FVPTC 2960955SLC17A5 NHP_PTC 3622934 SLC24A5 NHP_PTC 3018605 SLC26A4 NHP_PTC 3106559SLC26A7 NHP_PTC 3593575 SLC27A2 NHP_PTC 2827645 SLC27A6 NHP_PTC 2721959SLC34A2 NHP_PTC FA_FVPTC NHP_FVPTC 3216276 SLC35D2 FA_FVPTC 3389976SLC35F2 NHP_PTC 3804195 SLC39A6 NHP_PTC 2730746 SLC4A4 NHP_PTC 3467949SLC5A8 NHP_PTC 2786322 SLC7A11 FA_FVPTC NHP_FVPTC 3087659 SLC7A2 NHP_PTC2720584 SLIT2 NHP_PTC 3907190 SLPI NHP_PTC 3509842 SMAD9 FA_FVPTC2937144 SMOC2 NHP_PTC 3766960 SMURF2 NHP_PTC 2777714 SNCA NHP_PTC3597857 SNX1 NHP_PTC 3597914 SNX22 NHP_PTC 2348437 SNX7 NHP_PTC 2369557SOAT1 NHP_FVPTC 2797202 SORBS2 NHP_PTC 3413950 SPATS2 NHP_PTC 2522094SPATS2L NHP_PTC 2585933 SPC25 LCT_REST 3590164 SPINT1 NHP_PTC 2556752SPRED2 NHP_PTC 2742224 SPRY1 NHP_PTC FA_FVPTC 3519309 SPRY2 NHP_PTC3677969 SRL NHP_PTC 3408831 SSPN NHP_PTC 2562529 ST3GAL5 NHP_PTC 3011861STEAP2 FA_FVPTC 2834282 STK32A NHP_PTC NHP_FVPTC 3558418 STXBP6NHP_FVPTC 3102372 SULF1 NHP_PTC 2979871 SYNE1 NHP_PTC 2378256 SYT14NHP_PTC 3973891 SYTL5 NHP_PTC 2414958 TACSTD2 NHP_PTC 3898126 TASP1FA_FVPTC 3724698 TBC1D3F NHP_PTC 3264621 TCF7L2 FA_FVPTC 3913483 TCFL5FA_FVPTC 2435218 TDRKH NHP_PTC 3320944 TEAD1 NHP_PTC 3321055 TEAD1NHP_PTC 2573570 TFCP2L1 NHP_PTC 3933536 TFF3 NHP_PTC 2591421 TFPINHP_FVPTC 2558612 TGFA NHP_PTC 2380590 TGFB2 NHP_PTC 3181728 TGFBR1NHP_PTC 3976341 TIMP1 NHP_PTC FA_FVPTC 2649113 TIPARP NHP_PTC 3615579TJP1 NHP_PTC 3173880 TJP2 NHP_PTC 3751042 TLCD1 NHP_PTC 3969115 TLR8NHP_PTC 2700365 TM4SF1 NHP_PTC 2647315 TM4SF4 NHP_PTC 3110608 TM7SF4NHP_PTC 3763390 TMEM100 NHP_PTC 3412345 TMEM117 NHP_PTC 3346147 TMEM133NHP_PTC 2577482 TMEM163 NHP_PTC 2815220 TMEM171 FA_FVPTC NHP_FVPTC3166644 TMEM215 NHP_PTC NHP_FVPTC 3571904 TMEM90A NHP_PTC 3717870 TMEM98NHP_PTC 3351200 TMPRSS4 NHP_PTC 3222170 TNC NHP_PTC 3150455 TNFRSF11BFA_FVPTC 3645555 TNFRSF12A NHP_PTC FA_FVPTC NHP_FVPTC 3648391 TNFRSF17LCT_REST 3222128 TNFSF15 NHP_PTC 3907111 TOMM34 NHP_PTC 3136888 TOXLCT_REST 2924330 TPD52L1 NHP_PTC 2466554 TPO NHP_PTC FA_FVPTC NHP_FVPTC3818515 TRIP10 NHP_PTC 4018327 TRPC5 NHP_PTC FA_FVPTC NHP_FVPTC 3512294TSC22D1 NHP_PTC 2991150 TSPAN13 NHP_PTC 4015397 TSPAN6 NHP_PTC 3891342TUBB1 NHP_PTC 3779579 TUBB6 NHP_PTC 3401217 TULP3 NHP_PTC 3087167 TUSC3NHP_PTC 3809324 TXNL1 FA_FVPTC 3429460 TXNRD1 NHP_FVPTC 3775842 TYMSNHP_PTC 2448971 UCHL5 NHP_FVPTC 2974592 VNN1 NHP_FVPTC 2974635 VNN2NHP_PTC 2974610 VNN3 NHP_PTC 3203855 WDR40A NHP_PTC 2489228 WDR54NHP_PTC 3625052 WDR72 FA_FVPTC 3768474 WIPI1 NHP_FVPTC 2677356 WNT5ANHP_PTC 4015548 XKRX NHP_PTC 2370123 XPR1 NHP_PTC 3832280 YIF1B NHP_PTC2413484 YIPF1 FA_FVPTC 4024373 YTHDC2 NHP_PTC 3989089 ZBTB33 NHP_PTC3988596 ZCCHC12 NHP_PTC FA_FVPTC NHP_FVPTC 3987607 ZCCHC16 NHP_PTCFA_FVPTC NHP_FVPTC 3569754 ZFP36L1 NHP_PTC 2706791 ZMAT3 NHP_PTC 3132616ZMAT4 NHP_PTC FA_FVPTC NHP_FVPTC 2331903 ZNF643 NHP_PTC 3011675 ZNF804BNHP_PTC

TABLE 5 Trident Analysis This benign vs. malignant analysis resulted in210 unique TCIDs, currently mapping to 237 genes. These genes representthe union of three statistically significant sub-analyses (Repeatable,Bayes, and Tissue) using a single dataset. Gene Symbol TCID (Affyv.na29) Repeatable Bayes Tissue DE P value 3393720 MPZL2 TRUE TRUE TRUE1.49 1.87E−32 2400177 CAMK2N1 TRUE TRUE FALSE 1.67 2.27E−29 3067478NRCAM TRUE TRUE FALSE 1.42 2.53E−29 3445908 EPS8 TRUE TRUE TRUE 1.446.34E−29 3020343 MET TRUE TRUE FALSE 1.49 1.47E−27 4012178 CITED1 TRUETRUE FALSE 1.50 2.37E−27 2710599 CLDN1 TRUE TRUE FALSE 1.41 9.07E−273338192 CCND1 TRUE TRUE TRUE 1.36 2.63E−26 3338192 FLJ42258 TRUE TRUETRUE 1.36 2.63E−26 3126191 PSD3 TRUE TRUE TRUE 1.32 3.49E−25 2884845GABRB2 TRUE TRUE FALSE 1.73 4.07E−25 3087167 TUSC3 TRUE TRUE FALSE 1.496.22E−25 3907234 SDC4 TRUE TRUE FALSE 1.46 2.08E−24 2721959 ROS1 TRUETRUE FALSE 1.48 2.82E−24 2721959 SLC34A2 TRUE TRUE FALSE 1.48 2.82E−243679959 EMP2 FALSE TRUE FALSE 1.50 2.83E−24 2708855 C11orf72 TRUE TRUETRUE 1.59 1.31E−23 2708855 LIPH TRUE TRUE TRUE 1.59 1.31E−23 3416895METTL7B TRUE TRUE FALSE 1.49 2.12E−23 3136178 PLAG1 FALSE TRUE TRUE 1.412.37E−23 2442008 RXRG TRUE TRUE FALSE 1.60 3.50E−23 2657808 CLDN16 TRUETRUE TRUE 1.51 2.63E−22 3984945 ARMCX3 TRUE TRUE FALSE 1.45 3.13E−222567167 LONRF2 TRUE TRUE TRUE 1.38 3.67E−22 2685304 PROS1 TRUE TRUEFALSE 1.46 3.81E−22 3744463 MYH10 TRUE TRUE FALSE 1.46 6.20E−22 3415744IGFBP6 TRUE TRUE FALSE 1.56 9.91E−22 2834282 STK32A TRUE TRUE TRUE 1.271.03E−21 3554452 KIAA0284 TRUE TRUE FALSE 1.32 1.38E−21 3040518 MACC1TRUE TRUE FALSE 1.47 1.42E−21 3587495 SCG5 TRUE TRUE FALSE 1.34 1.74E−212686023 DCBLD2 TRUE TRUE FALSE 1.18 1.83E−21 3335894 CST6 FALSE TRUEFALSE 1.43 2.29E−21 2783596 PDE5A TRUE TRUE TRUE 1.55 2.63E−21 3522398AIDA TRUE TRUE FALSE 1.38 2.99E−21 3522398 DOCK9 TRUE TRUE FALSE 1.382.99E−21 3638204 MFGE8 TRUE TRUE FALSE 1.51 5.35E−21 3638204 QTRT1 TRUETRUE FALSE 1.51 5.35E−21 3323052 NAV2 TRUE TRUE FALSE 1.30 7.00E−212924492 HEY2 TRUE TRUE FALSE 1.48 2.01E−20 3726154 ITGA3 TRUE TRUE FALSE1.35 2.16E−20 2924330 TPD52L1 TRUE TRUE FALSE 1.17 2.21E−20 3988596ZCCHC12 TRUE TRUE FALSE 1.52 2.85E−20 3683377 GPRC5B TRUE TRUE FALSE1.28 4.84E−20 3417249 ERBB3 FALSE TRUE FALSE 1.51 6.63E−20 2511820 PKP4TRUE TRUE TRUE 1.22 7.51E−20 4020655 ODZ1 TRUE TRUE FALSE 1.34 8.32E−203628832 DAPK2 FALSE TRUE FALSE 1.34 1.20E−19 3007960 CLDN4 TRUE TRUEFALSE 1.20 1.42E−19 2598261 FN1 TRUE TRUE FALSE 1.31 3.25E−19 2936857LOC730031 TRUE TRUE TRUE 1.12 5.16E−19 2936857 MLLT4 TRUE TRUE TRUE 1.125.16E−19 3666366 CDH3 TRUE TRUE TRUE 1.48 6.10E−19 3757108 KRT19 TRUETRUE FALSE 1.41 6.20E−19 3451375 PRICKLE1 FALSE TRUE TRUE 1.42 8.79E−193338552 CTTN TRUE TRUE TRUE 1.10 9.53E−19 2680046 ADAMTS9 TRUE TRUEFALSE 1.40 1.06E−18 3867458 PLEKHA4 TRUE TRUE FALSE 1.35 1.50E−183494629 SCEL TRUE TRUE FALSE 1.39 1.57E−18 3978943 KLF8 TRUE TRUE FALSE1.35 3.66E−18 2397025 DHRS3 TRUE TRUE FALSE 1.22 3.89E−18 3420316 HMGA2TRUE TRUE TRUE 1.48 4.63E−18 3126368 PSD3 TRUE TRUE FALSE 1.19 5.77E−182809245 ITGA2 TRUE TRUE FALSE 1.45 6.16E−18 2526806 FN1 TRUE TRUE TRUE1.19 7.55E−18 2827645 SLC27A6 FALSE TRUE FALSE 1.49 8.33E−18 3217361ANKS6 TRUE TRUE FALSE 1.19 8.37E−18 3743551 CLDN7 TRUE TRUE FALSE 1.071.80E−17 3571904 NPC2 FALSE TRUE FALSE 0.99 2.53E−17 3571904 TMEM90AFALSE TRUE FALSE 0.99 2.53E−17 2558612 TGFA TRUE TRUE FALSE 1.352.71E−17 3987607 CCDC121 TRUE TRUE FALSE 1.46 3.28E−17 3987607 ZCCHC16TRUE TRUE FALSE 1.46 3.28E−17 3088213 SH2D4A TRUE TRUE FALSE 1.185.07E−17 3751002 RAB34 TRUE TRUE FALSE 1.19 5.77E−17 3973891 CXorf27TRUE TRUE FALSE 1.52 6.03E−17 3973891 SYTL5 TRUE TRUE FALSE 1.526.03E−17 3044072 NOD1 TRUE TRUE TRUE 1.45 6.85E−17 2370123 XPR1 TRUETRUE FALSE 1.26 7.13E−17 3174816 ANXA1 FALSE TRUE TRUE 1.08 7.85E−172966193 C6orf168 TRUE TRUE FALSE 1.37 1.01E−16 2525533 LOC648149 TRUETRUE FALSE 1.24 1.02E−16 2525533 MAP2 TRUE TRUE FALSE 1.24 1.02E−163154002 KCNQ3 TRUE TRUE FALSE 1.41 1.09E−16 3590164 SPINT1 TRUE TRUEFALSE 1.17 1.35E−16 3329343 MDK TRUE TRUE TRUE 1.28 1.58E−16 2875193P4HA2 TRUE TRUE FALSE 1.10 1.80E−16 3726691 ABCC3 TRUE TRUE FALSE 1.171.86E−16 2451870 ETNK2 TRUE TRUE TRUE 1.33 1.91E−16 4018327 TRPC5 TRUETRUE TRUE 1.48 2.43E−16 3046197 ELMO1 TRUE TRUE TRUE −1.26 2.80E−162460817 SIPA1L2 TRUE TRUE TRUE 1.17 3.16E−16 3976341 TIMP1 TRUE TRUETRUE 1.15 3.39E−16 2973232 C6orf174 TRUE TRUE FALSE 1.42 3.78E−162973232 KIAA0408 TRUE TRUE FALSE 1.42 3.78E−16 3417809 NAB2 TRUE TRUEFALSE 1.25 5.50E−16 2751936 GALNT7 TRUE TRUE FALSE 1.17 5.95E−16 2648535SGEF TRUE FALSE FALSE 1.16 1.33E−15 3759587 LOC100129115 TRUE TRUE FALSE1.34 1.47E−15 3759587 PLCD3 TRUE TRUE FALSE 1.34 1.47E−15 3994710 MAMLD1FALSE TRUE FALSE 1.37 1.80E−15 3581221 AHNAK2 TRUE TRUE FALSE 1.312.29E−15 3259253 C10orf131 FALSE TRUE TRUE 1.01 4.17E−15 3259253 ENTPD1FALSE TRUE TRUE 1.01 4.17E−15 2562435 EDNRB FALSE TRUE FALSE 1.375.28E−15 2562435 SFTPB FALSE TRUE FALSE 1.37 5.28E−15 3489138 CYSLTR2TRUE TRUE TRUE 1.30 5.69E−15 3002640 EGFR TRUE TRUE TRUE 1.11 8.20E−152578790 LRP1B FALSE TRUE FALSE −0.95 1.06E−14 3768535 FAM20A FALSE TRUEFALSE 1.25 1.11E−14 3044129 GGCT TRUE TRUE FALSE 1.11 1.12E−14 2980449IPCEF1 TRUE TRUE TRUE −1.14 1.29E−14 4018454 AMOT TRUE TRUE FALSE 1.341.47E−14 3763390 TMEM100 TRUE TRUE TRUE 1.40 2.44E−14 2740067 ANK2 FALSETRUE TRUE −0.89 2.57E−14 3622934 MYEF2 TRUE TRUE TRUE 1.03 4.13E−143622934 SLC24A5 TRUE TRUE TRUE 1.03 4.13E−14 2414958 TACSTD2 FALSE TRUEFALSE 1.29 5.50E−14 3321150 ARNTL TRUE TRUE TRUE 1.18 7.68E−14 3464860DUSP6 TRUE TRUE FALSE 1.10 1.17E−13 3464860 LOC100131490 TRUE TRUE FALSE1.10 1.17E−13 3217242 GABBR2 TRUE TRUE TRUE 1.21 1.22E−13 3110608 TM7SF4TRUE TRUE TRUE 1.23 2.16E−13 3110395 RIMS2 TRUE TRUE FALSE 1.13 2.54E−133649714 C16orf45 TRUE TRUE FALSE 1.10 7.74E−13 3867264 CA11 TRUE TRUEFALSE 1.05 8.23E−13 3832280 C19orf33 TRUE TRUE FALSE 1.20 8.77E−133832280 YIF1B TRUE TRUE FALSE 1.20 8.77E−13 2452440 KLHDC8A TRUE TRUEFALSE 1.08 1.39E−12 2608469 ITPR1 TRUE TRUE TRUE −1.10 1.71E−12 3577612SERPINA1 FALSE TRUE FALSE 0.96 2.24E−12 3577612 SERPINA2 FALSE TRUEFALSE 0.96 2.24E−12 4015548 XKRX TRUE TRUE FALSE 1.12 2.68E−12 3451814MAFG FALSE TRUE TRUE 1.04 2.91E−12 3451814 NELL2 FALSE TRUE TRUE 1.042.91E−12 2734421 ARHGAP24 FALSE TRUE FALSE −1.05 3.17E−12 2816298 IQGAP2TRUE TRUE FALSE −1.10 5.75E−12 2524301 NRP2 FALSE TRUE FALSE 0.937.41E−12 3132616 ZMAT4 FALSE TRUE TRUE −0.89 1.03E−11 3365136 SERGEFFALSE TRUE TRUE 0.98 1.04E−11 3367673 MPPED2 FALSE TRUE FALSE −0.951.18E−11 2608309 LRRN1 FALSE FALSE TRUE 0.84 1.66E−11 2820925 RHOBTB3FALSE TRUE TRUE 0.85 2.73E−11 3369931 RAG2 FALSE TRUE TRUE −0.753.90E−11 2708922 IGF2BP2 FALSE TRUE TRUE 0.90 5.15E−11 3868783 KLK7 TRUETRUE TRUE 1.19 7.94E−11 3006572 AUTS2 TRUE TRUE FALSE 1.06 1.02E−103411810 PDZRN4 TRUE TRUE FALSE 1.20 1.21E−10 2876897 SPOCK1 TRUE FALSEFALSE 1.05 1.39E−10 3166644 TMEM215 FALSE FALSE TRUE 0.98 1.49E−103933536 TFF3 FALSE TRUE FALSE −0.80 2.50E−10 3159330 DOCK8 FALSE TRUETRUE −0.90 2.53E−10 3279058 ACBD7 FALSE TRUE TRUE 1.03 2.83E−10 3593931GLDN TRUE TRUE FALSE 1.13 3.46E−10 3404030 KLRG1 FALSE TRUE TRUE −0.885.39E−10 2373842 PTPRC FALSE FALSE TRUE −0.90 9.75E−10 3010503 CD36FALSE TRUE TRUE −0.81 3.46E−09 2583374 PLA2R1 FALSE TRUE TRUE −0.726.14E−09 3856646 ZNF208 FALSE FALSE TRUE 0.77 6.91E−09 3692999 MT1GFALSE TRUE TRUE −0.82 1.01E−08 2587790 GPR155 FALSE TRUE FALSE −0.861.12E−08 2362351 PYHIN1 FALSE FALSE TRUE −0.76 1.46E−08 2727587 KITFALSE TRUE FALSE −0.75 1.50E−08 2427619 KCNA3 FALSE FALSE TRUE −0.781.50E−08 3142381 FABP4 FALSE TRUE FALSE −0.72 1.82E−08 2584018 DPP4FALSE TRUE TRUE 0.78 2.22E−08 2387126 RYR2 FALSE TRUE TRUE −0.642.26E−08 2823880 CAMK4 FALSE FALSE TRUE −0.72 2.67E−08 3410384 C12orf35FALSE FALSE TRUE −0.78 2.74E−08 2466554 TPO FALSE TRUE FALSE −0.775.30E−08 2806468 IL7R FALSE FALSE TRUE −0.78 1.04E−07 2730746 SLC4A4FALSE TRUE TRUE −0.73 1.12E−07 3467949 SLC5A8 FALSE FALSE TRUE −0.741.23E−07 2518272 CERKL FALSE FALSE TRUE −0.74 1.58E−07 2518272 ITGA4FALSE FALSE TRUE −0.74 1.58E−07 3450861 ABCD2 FALSE FALSE TRUE −0.661.63E−07 3389450 CARD16 FALSE FALSE TRUE −0.78 1.66E−07 3389450 CASP1FALSE FALSE TRUE −0.78 1.66E−07 2657831 IL1RAP FALSE TRUE FALSE 0.781.85E−07 3059667 SEMA3D FALSE TRUE TRUE −0.71 2.04E−07 4013460 CYSLTR1FALSE FALSE TRUE −0.71 2.12E−07 3126504 CSGALNACT1 FALSE TRUE TRUE −0.652.29E−07 3811339 BCL2 FALSE TRUE TRUE −0.76 2.29E−07 2724671 RHOH FALSEFALSE TRUE −0.69 2.37E−07 3160895 JAK2 FALSE FALSE TRUE −0.74 2.48E−072486811 PLEK FALSE FALSE TRUE −0.75 2.66E−07 3443804 KLRB1 FALSE FALSETRUE −0.73 2.84E−07 3576704 TC2N FALSE TRUE TRUE −0.74 3.29E−07 3742627C17orf87 FALSE FALSE TRUE −0.70 4.80E−07 3347658 ATM FALSE FALSE TRUE−0.65 4.89E−07 3347658 NPAT FALSE FALSE TRUE −0.65 4.89E−07 2815220TMEM171 FALSE FALSE TRUE −0.60 5.00E−07 3960174 LGALS2 FALSE FALSE TRUE−0.70 5.58E−07 2462329 ERO1LB FALSE TRUE TRUE −0.67 6.74E−07 2608725BHLHE40 FALSE TRUE TRUE 0.72 8.08E−07 3389353 CARD17 FALSE FALSE TRUE−0.72 1.09E−06 3389353 CASP1 FALSE FALSE TRUE −0.72 1.09E−06 3062082PDK4 FALSE FALSE TRUE 0.67 1.22E−06 2593159 STK17B FALSE FALSE TRUE−0.65 1.88E−06 2353669 CD2 FALSE FALSE TRUE −0.67 2.06E−06 2428796PTPN22 FALSE FALSE TRUE −0.66 2.70E−06 2422035 GBP5 FALSE FALSE TRUE−0.69 3.37E−06 2766289 TMEM156 FALSE FALSE TRUE −0.57 4.55E−06 3060450C7orf62 FALSE FALSE TRUE −0.61 5.81E−06 2439554 AIM2 FALSE FALSE TRUE−0.60 6.78E−06 3443891 CLEC2B FALSE FALSE TRUE −0.58 3.51E−05 2766192TLR10 FALSE FALSE TRUE −0.51 3.87E−05 3536706 LGALS3 FALSE TRUE FALSE0.52 4.67E−05 3009838 CCDC146 FALSE FALSE TRUE −0.56 7.30E−05 3009838POLR2J4 FALSE FALSE TRUE −0.56 7.30E−05 2412312 TTC39A FALSE FALSE TRUE0.51 7.45E−05 2548699 CYP1B1 FALSE TRUE FALSE 0.49 3.52E−04 3443868 CD69FALSE FALSE TRUE −0.47 4.85E−04 3461981 TSPAN8 FALSE FALSE TRUE −0.447.33E−04 3648391 TNFRSF17 FALSE FALSE TRUE −0.44 7.66E−04 3018605SLC26A4 FALSE TRUE TRUE −0.46 9.81E−04 3107828 PLEKHF2 FALSE FALSE TRUE−0.42 1.19E−03 2372812 RGS13 FALSE FALSE TRUE −0.38 1.66E−03 3197955GLDC FALSE FALSE TRUE −0.37 5.51E−03 2796995 SORBS2 FALSE FALSE TRUE−0.32 1.01E−02 3135567 LYPLA1 FALSE FALSE TRUE −0.32 1.78E−02 2732508CXCL13 FALSE FALSE TRUE −0.30 1.94E−02 3200982 MLLT3 FALSE FALSE TRUE−0.30 2.03E−02 2735027 SPP1 FALSE FALSE TRUE 0.25 6.47E−02 2554018EFEMP1 FALSE FALSE TRUE −0.20 1.55E−01 2945882 CMAH FALSE FALSE TRUE−0.21 1.65E−01 2767378 ATP8A1 FALSE FALSE TRUE 0.20 1.79E−01 4016193TMSB15A FALSE FALSE TRUE −0.16 2.27E−01 4016193 TMSB15B FALSE FALSE TRUE−0.16 2.27E−01 3019158 LRRN3 FALSE FALSE TRUE 0.16 2.57E−01 2700244 CPFALSE FALSE TRUE 0.12 4.37E−01 2700244 HPS3 FALSE FALSE TRUE 0.124.37E−01 2855285 CCDC152 FALSE FALSE TRUE −0.10 4.49E−01 2855285 SEPP1FALSE FALSE TRUE −0.10 4.49E−01 2773947 CXCL9 FALSE FALSE TRUE −0.104.56E−01 3108226 PGCP FALSE TRUE TRUE 0.04 7.65E−01 2773972 CXCL11 FALSEFALSE TRUE 0.02 8.93E−01

Algorithms for Disease Diagnostics

The goal of algorithm development is to extract biological informationfrom high-dimensional transcription data in order to accurately classifybenign vs. malignant biopsies. In some embodiments, disclosed herein isa molecular classifier algorithm, which combines the process of upfrontpre-processing of exon data, followed by exploratory analysis, technicalfactor removal (when necessary), feature (i.e. marker or gene)selection, classification, and finally, performance measurements. Otherembodiments also describe the process of cross-validation within andoutside the feature selection loop as well as an iterative geneselection method algorithm to combine three analytical methods of markerextraction (tissue. Bayesian, repeatability).

The nature of the training set represents the biggest obstacle to thecreation of a robust classifier. Often, a given clinical cohort islimited by the prevalence of disease subtypes that are represented,and/or there is no clear phenotypic, biological, and/or moleculardistinction between disease subtypes. Theoretically, the training setcan be improved by increasing the number and nature of samples in aprospective clinical cohort, however this approach is not alwaysfeasible. Joining of multiple datasets to increase the overall size ofthe cohort used for training the classifier can be accompanied byanalytical challenges and experimental bias.

In on aspect, the present invention overcomes limitations in the currentart by joining multiple datasets and applying a technical factor removalnormalization approach to the datasets either prior to and/or duringclassification. In another aspect, the present invention providesmethods for gene selection. In another aspect, the present inventionintroduces a novel ROC-based method for obtaining more accuratesubtype-specific classification error rates. Multiple datasets belongingto distinct experiments can be combined and analyzed together. Thisincreases the number of samples available for model training and theoverall accuracy of the predictive algorithm.

1. Quality Control

Affymetrix Power Tools (APT, version 1.10.2) software can be used toprocess, normalize and summarize output (post-hybridization) microarraydata (.CEL) files. Quantile normalization, detection above background(DABG), and robust multichip average (RMA) determination of AUC can bedone using APT, a program that has been written and streamlined for theautomatic processing of post-hybridization data. This automatedprocessing script produces a probeset-level intensity matrix and agene-level intensity matrix. DABG can be computed as the fraction ofprobes having smaller than p<1e-4 when compared with background probesof the similar GC content (Affymetrix). Accurate classification may beencumbered by a variety of technical factors including failed orsuboptimal hybridization. Post-Hybridization QC metrics can becorrelated with Pre-Hybridization QC variables to identify the technicalfactors that may obscure or bias signal intensity.

i. Classification Version

In some embodiments, Classification is used to analyze data in thesubject method. The version of the engine used to generate reports atthe end of discovery has been tagged as Release-Classification-1.0 inSVN. In one example, the data analyzed by Classification are .CEL filesgenerated from Thyroid Tissue and FNA samples run on Affymetrix HumanExon 1 ST array with NA26 annotations.

In one example, the overall workflow after scanning the microarrays isas follows: output .CEL files→APT Intensity Matrix→pDABG/AUC→removesamples with AUC≤0.73 and DABG≤Plot→PCAs of each categorical technicalfactor→Plot variance component as function of each technicalfactor→Determine if additional samples need to be removed orflagged→Determine if factor needs removal globally or withincross-validation→run classification.

Two common QC metrics used to pass or fail a sample in order to enter itinto the molecular classifier are intron/Exon separation AUC and pDABGor pDET. The threshold for AUC can be, for example, around 0.73. Thethreshold for pDET (percentage of genes or probe sets that are detectedabove background) can be adjusted for different data sets as learningcontinues during marker discovery.

2. Exploratory Analysis

In some embodiments, the present invention utilizes one or moreexploratory methods to generate a broad preliminary analysis of thedata. These methods are used in order to assess whether technicalfactors exist in the datasets that may bias downstream analyses. Theoutput from exploratory analyses can be used to flag any suspicioussamples, or batch effects, Flagged samples or subsets of samples canthen be processed for technical factor removal prior to, and/or duringfeature selection and classification. Technical factor removal isdescribed in detail in section 3. The methods used for exploratoryanalyses include but are not limited to:

Principal component analysis (PCA) can be used to assess the effects ofvarious technical factors, such as laboratory processing batches or FNAsample collection media, on the intensity values. To assess the effectsof technical factors, the projection of the normalized intensity valuesto the first few principal components can be visualized in a pair-wisemanner, color coded by the values of the technical variable. If asignificant number of samples are affected by any given technical factorand the first few principal components show separation according to thefactor, this factor can be considered a candidate for computationalremoval during subsequent phases of analysis.

In addition to PCA visualization, the present invention can utilizeanalysis of variance (variance components) as a quantitative measure toisolate technical factors that have significant effect on normalizedintensity values. Variance decomposition can be achieved by fitting alinear model to the normalized intensity values for each of the genesthat passes non-specific filtering criteria. The explanatory variablesin the linear model include biological factors as well as technicalfactors of interest. When categorical technical factors are representedsparsely in the data, combinations of these factors can be explored asexplanatory variables in the model to reduce the number of parametersand enable estimation of effect sizes due to individual variables. Inone embodiment, once the linear model is fitted for gene n, the omegasquared measure (ω²) is used to provide an unbiased assessment of theeffect size for each of the explanatory variables j on the individualgene (Bapat, R. B. (2000). Linear Algebra and Linear Models (Second ed.)Springer):

ω2=(SS effect−df effect*MS error)(MS error+SS total)

Here SS_(effect) is the sum of squares due to the explanatory variable,MS, is the mean squared error, SS_(total) is the total sum of squares,and df_(effect) is the degrees of freedom associated with a particularvariable. To assess average effect size across all genes passingnon-specific filtering criteria, average values of ω²are calculatedacross all genes and visualized either as raw effect sizes orproportions of total variance explained. FIG. 23 shows an example plotwith average effect sizes assessed across one biological factor(pathology class) and three technical factors (one continuous andcategorical). Non-biological explanatory variables with effect sizesgreater than or comparable to the biological factors are consideredcandidates for computational removal as technical factors.

3. Technical Factor Removal

PCA and Variance components can be used to assess the magnitude andsignificance of the technical variability in the data relative to thebiological signal. If it was deemed that technical sources ofvariability must be removed, then the regression method can be used toremove that effect.

(a) Details on the regression method: In a supervised setting, thismethod can be used to adjust the probe intensities for variation due totechnical reasons (e.g. sample collection media) in the presence of theprimary variable of interest (the disease label). Adjustments fortechnical factors can be made both in gene/feature selection, as well asin feature adjustment necessary for correct classification. For example:

(I) Feature Selection Linear Model:

E(y)=β0+β1βM+β2TF1+β3BM*TF2+ . . . +ε

where TF1 is technical factor 1; and BM is the variable which containsthe label ‘B’ or ‘M’. The current call to LIMMA for feature selectionwould be extended to support the adjustment by technical factors (up to3) and corresponding 2-way interaction terms with the BM variable, ifneeded.

Feature Adjustment: The features themselves can be adjusted in thefollowing way:

Y−Ŷ−X{circumflex over (β)}

where {circumflex over (β)} are the estimated coefficients from theterms in the feature selection linear model equation which involvetechnical factors. In some instances, the model matrix will contain onlythe variables containing the technical factor and will not contain thecolumn of 1's (the intercept term).

In unsupervised correction, the technical factor (TF) covariate can beused to shift the means between samples of one type (e.g., banked FNA)and those of another (e.g., prospective FNA). A boxplot of all the probeintensities for each sample will show whether such a “shift in means”exists due to known factors of technical variation.

In some embodiments, only if the technical source of variability issimply a global “shift in means” or linear and is not confounded bydisease subtype then the regression method in an unsupervised settingwill be applied. This would be an unsupervised correction, i.e., nodisease labels will be used in the correction step.

In some embodiments, if evidence of technical variability is present inthe data, but biological signal overwhelms it, no correction is appliedto the data sets. A list of co-variables that can be examined by thesubject algorithm is shown in Table 7.

TABLE 7 Technical factors or variables considered in the algorithmVariable Values Collection source OR vs. Clinic Collection method BankedFNA vs. Prospective FNA Collection media Trizol vs. RNAProtect RNA RINContinuous WTA yield Continuous ST yield Continuous Hybridization siteLaboratory 1 vs. Laboratory 2 Hybridization quality (AUC) ContinuousGeneral pathology Benign vs. Malignant Subtype pathology LCT, NHP, FA,HA, FC, FVPTC, PTC, MTC Experiment batch FNA TRIzol 1-4 vs. FNARNAprotect 1-4 or FNA TRIzol vs. FNA RNAprotect Lab contaminationDominant peak, band seen, both

Classification accuracy, sensitivity, specificity, ROC curves, error vsnumber of markers curves, positive predictive value (PPV) and negativepredictive value (NPV) can be reported using these approaches. Themethods of the present invention have sensitivity required to detectrare transcripts, which are expressed at a few copies per cell, and toreproducibly detect at least approximately two-fold differences in theexpression levels. In some embodiments, the subject methods provide ahigh sensitivity of detecting gene expression and therefore detecting agenetic disorder or cancer that is greater than 60%, 65%, 70%, 75%, 80%,85%, 90%, 91%, 92%, 93%, 94%, 95%, 95.5%, 96%, 96.5%, 97%, 97.5%, 98%,98.5%, 99%, 99.5% or more. Therefore, the sensitivity of detecting andclassifying a genetic disorder or cancer is increased. Theclassification accuracy of the subject methods in classifying geneticdisorders or cancers can be greater than 70%, 75%, 80%, 85%, 90%, 91%,92%, 93%, 94%, 95%, 95.5%, 96%, 96.5%, 97%, 97.5%, 98%, 98.5%, 99%,99.5% or more. In some embodiments, the subject methods provide a highspecificity of detecting and classifying gene expression that is greaterthan, for example, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 95.5%, 96%,96.5%, 97%, 97.5%, 98%, 98.5%, 99%, 99.5% or more. In some embodiments,the nominal specificity is greater than or equal to 70%. The nominalnegative predictive value (NPV) is greater than or equal to 95%. In someembodiments, the NPV is about 95%, 95.5%, 96%, 96.5%, 97%, 97.5%, 98%,98.5%, 99%, 99.5% or more.

In some embodiments, feature selection (gene selection) can also becombined with technical factor removal. Often a variable, such as thesample collection media, provides a distinct shift at the intensitylevel. If the variable is associated with the disease subtype (Benign vsMalignant) then feature selection can account for this variable (acofounder) in the regression model (LIMMA). The details of accountingfor technical factor effects on the intensities in feature selection aredescribed in Section 3 above. Repeatable feature selection can becarried out together with correction/removal of relevant technicalfactors present in the data set for each regression model applied to thegenes.

In cross-validation mode at least two methods can be used, either one ata time, or in succession. The cross-validation methods are K-foldcross-validation and leave-one-out cross-validation (LOOCV). In oneembodiment, feature selection (with or without technical factor removal)is incorporated within each loop of cross-validation. This enables theuser to obtain unbiased estimates of error rates. Further, featureselection can also be performed within certain classifiers (e.g., randomforest) in a multivariate setting. The classifier takes the featuresselected and the previously built training-set model and makes aclassification call (Benign or not Benign) on the test set. Thisprocedure of repeatedly splitting the data into training and test setsand providing a single averaged error rate at the end gives an unbiasederror rate in the cross-validation mode.

Training and validation data sets can be normalized and processedtogether (APT. RMA with quantile normalization) including removal oftechnical factors when necessary. The data can be split into trainingand validation sets based on specific criteria (e.g., balancing each setby relevant covariate levels).

In several embodiments, algorithm training can be conducted on a sampleof surgical tissue, FNA's collected in TRIzol, and/or FNAs collected inRNAProtect. In testing of algorithm performance results, results wereobtained where approximately 90% non-benign percent agreement (akasensitivity) and 93% benign agreement (aka specificity) on a select setof samples that pass certain pre- and post-chip metrics.

Training of the algorithm can include feature (ie. Gene) selection. Eachround of training can result in a de novo set of markers, for example,5, 10, 25, 50, 100, 200, 300 or 500 markers. In one example, comparisonof marker lists across the three key discovery training sets (surgicaltissue, FNA's collected in TRIzol, and FNAs collected in RNAProtect)revealed a total of 338 non-redundant markers; of these 158 markers arein all three marker lists.

In some embodiments, the exon array platform used in the presentinvention measures mRNA levels of all known human genes (24,000) and allknown transcripts (>200,000). This array is used on every sample run infeasibility (i.e. gene discovery), therefore the algorithm is trained onthe full complement of genes at every step. Throughout algorithmtraining, feature (i.e. gene) selection occurs de novo for everyexperimental set. Thus, features may be selected from multipleexperiments and later combined.

Marker panels can be chosen to accommodate adequate separation of benignfrom non-benign expression profiles. Training of this multi-dimensionalclassifier, i.e., algorithm, was performed on over 500 thyroid samples,including >300 thyroid FNAs. Many training/test sets were used todevelop the preliminary algorithm. First the overall algorithm errorrate is shown as a function of gene number for benign vs non-benignsamples. All results are obtained using a support vector machine modelwhich is trained and tested in a cross-validated mode (30-fold) on thesamples.

In some embodiments, the difference in gene expression level is at least10%, 15%, 20%, 25%, 30%, 35%, 40%, 45% or 50% or more. In someembodiments, the difference in gene expression level is at least 2, 3,4, 5, 6, 7, 8, 9, 10 fold or more. In some embodiments, the biologicalsample is identified as cancerous with an accuracy of greater than 75%,80%, 85%, 90%, 95%, 99% or more. In some embodiments, the biologicalsample is identified as cancerous with a sensitivity of greater than95%. In some embodiments, the biological sample is identified ascancerous with a specificity of greater than 95%. In some embodiments,the biological sample is identified as cancerous with a sensitivity ofgreater than 95% and a specificity of greater than 95%. In someembodiments, the accuracy is calculated using a trained algorithm.

In some embodiments of the present invention, results are classifiedusing a trained algorithm. Trained algorithms of the present inventioninclude algorithms that have been developed using a reference set ofknown malignant, benign, and normal samples. The classification schemeusing the algorithms of the present invention is shown in FIG. 23.Algorithms suitable for categorization of samples include but are notlimited to k-nearest neighbor algorithms, concept vector algorithms,naive bayesian algorithms, neural network algorithms, hidden markovmodel algorithms, genetic algorithms, and mutual information featureselection algorithms or any combination thereof. In some cases, trainedalgorithms of the present invention may incorporate data other than geneexpression or alternative splicing data such as but not limited toscoring or diagnosis by cytologists or pathologists of the presentinvention, information provided by the pre-classifier algorithm of thepresent invention, or information about the medical history of thesubject of the present invention.

While preferred embodiments of the present invention have been shown anddescribed herein, it will be obvious to those skilled in the art thatsuch embodiments are provided by way of example only. Numerousvariations, changes, and substitutions will now occur to those skilledin the art without departing from the invention. It should be understoodthat various alternatives to the embodiments of the invention describedherein may be employed in practicing the invention. It is intended thatthe following claims define the scope of the invention and that methodsand structures within the scope of these claims and their equivalents becovered thereby.

1.-35. (canceled)
 36. A method for processing or analyzing a sample oftissue of a subject, comprising: (a) processing said sample of tissue toyield a data set including data corresponding to levels of geneexpression products in said sample of tissue, which sample isindeterminate when subjected to cytological analysis; (b) inputting saiddata from (a) to a trained algorithm in a programmed computer togenerate a classification of said sample of tissue as positive ornegative for a disease at an accuracy of at least 90%, wherein saidtrained algorithm is trained with a plurality of training samples thatis independent of said sample of tissue; and (c) electronicallyoutputting a report that identifies said classification of said sampleof tissue as positive or negative for said disease.
 37. The method ofclaim 36, wherein said disease is cancer.
 38. The method of claim 37,wherein said cancer is thyroid cancer, lung cancer, adrenal corticalcancer, anal cancer, bile duct cancer, bladder cancer, bone cancer,breast cancer, cervical cancer, colorectal cancer, endometrial cancer,esophagus cancer, eye cancer, gallbladder cancer, gastrointestinalcarcinoid tumors, gastrointestinal stromal tumors, kidney cancer, acutelymphocytic leukemia, acute myeloid leukemia, liver cancer,Non-Hodgkin's lymphoma, multiple myeloma, nasopharyngeal cancer,neuroblastoma, oropharyngeal cancer, osteosarcoma, ovarian cancer,pancreatic cancer, pituitary tumor, prostate cancer, retinoblastoma,melanoma stomach cancer, testicular cancer, thymus cancer, or uterinecancer.
 39. The method of claim 37, wherein said cancer is thyroidcancer.
 40. The method of claim 37, wherein said cancer is lung cancer.41. The method of claim 37, wherein said cancer is colorectal cancer.42. The method of claim 37, wherein said cancer is prostate cancer. 43.The method of claim 36, wherein said disease a hyperproliferativedisorder.
 44. The method of claim 36, wherein said sample of tissuecomprises buccal tissue, skin tissue, heart tissue, lung tissue, kidneytissue, breast tissue, pancreas tissue, liver tissue, muscle tissue,smooth muscle tissue, bladder tissue, gall bladder tissue, colon tissue,intestine tissue, brain tissue, prostate tissue, or esophagus tissue.45. The method of claim 36, wherein said sample of tissue is obtained byswabbing.
 46. The method of claim 36, wherein said sample of tissuecomprises two or more tissue types.
 47. The method of claim 46, whereina first portion of said sample of tissue comprises a buccal tissue. 48.The method of claim 36, wherein said sample of tissue comprises a bloodsample.
 49. The method of claim 36, wherein said data comprisesribonucleic acid (RNA) data.
 50. The method of claim 49, wherein saidRNA data comprises micro RNA data.
 51. The method of claim 36, whereinsaid data comprises deoxyribonucleic acid (DNA) data.
 52. The method ofclaim 51, wherein said processing comprises identifying a copy numbervariation or a variant in said DNA data.
 53. The method of claim 36,wherein said processing comprises assaying a portion of said sample oftissue by sequencing, array hybridization or nucleic acid amplification.54. The method of claim 36, wherein said plurality of training samplescomprise a metastatic melanoma sample, a metastatic renal carcinomasample, a metastatic breast carcinoma sample, a metastatic B celllymphoma sample, or any combination thereof.
 55. The method of claim 36,wherein said plurality of training samples comprises a normal tissuesample and a plurality of samples having different tissue pathologies.56. The method of claim 36, wherein said trained algorithm generatessaid classification at a specificity of at least about 90%.
 57. Themethod of claim 36, wherein said trained algorithm generates saidclassification at a sensitivity of at least about 80%.
 58. The method ofclaim 36, wherein said sample of tissue has a benign condition, andwherein said trained algorithm does not classify said sample of tissueas positive for said disease.
 59. The method of claim 36, wherein saidsample of tissue has a malignant condition, and wherein said trainedalgorithm classifies said sample of tissue as positive for said disease.